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In This Issue 


This issue of Survey Methodology opens with a special section on Small Area Estimation. The 
three papers in this special section consider the problem of domain estimation from a variety of 
perspectives. I would like to give special thanks to Jon Rao for coordinating the editorial work for 
this special section. One or two other papers on this topic, which were not yet ready for publication, 
may also appear in a later issue. 

The first paper in the special section, by Singh, Gambino and Mantel, considers the problem of 
small area statistics from the perspective of survey design. They discuss the role of sample design 
features such as stratification, clustering and sample allocation in the production of small area 
statistics for both planned and unplanned domains. A short overview of current approaches to small 
area estimation is also included. The paper is followed by insightful comments by Fuller and Kalton 
and a response from the authors. 

The paper by Holt and Holmes presents a model based approach to small area estimation that 
does not ‘‘borrow strength’’ from other domains, and which may be used when auxiliary totals and 
means are not available. Estimates of model parameters are combined with design based estimates 
of means or totals of covariates. Using an example from market research it is shown that the method 
can lead to significant gains in efficiency of estimates for small domains. 

The last paper in the special section, by Singh, Mantel and Thomas, presents an empirical 
comparison of several different small area estimators using simulated sampling from a population 
of farms. It is shown that, in the context of repeated surveys, estimators based on time series models 
can perform better, with respect to both bias and mean squared error, than those based on models 
for a single time point. 

Kovar and Chen present results of a simulation study in which they investigated statistical properties 
of the jackknife approach to variance estimation of imputed data sets. Under this approach, the 
variance due to imputation is incorporated in the variance estimator. Real data sets, four different 
imputation methods, simple random sampling and a uniform nonresponse mechanism were used. 
Performance under a stratified multistage design and a non-uniform nonresponse mechanism was 
also studied. 

Tracy and Osahan propose ratio estimators associated with two sampling strategies for estimation 
of a population mean in overlapping clusters with unknown population size. While much work by 
several researchers is available on non-overlapping clusters in the literature, there are many practical 
sampling situations where one gets overlapping clusters. The first sampling strategy is an equal probability 
with replacement sampling scheme while the second strategy is an unequal probability sampling scheme. 

Prasad and Graham extend the ‘‘Random Group Method”’ for sampling with probability proportional 
to size (PPS) to sampling over two occasions. They use for this purpose the information on a study 
variate observed on the first occasion to select the matched portion of the sample on the second occasion. 

Sitter and Skinner show how linear programming may be used to find an optimal sample design 
in the context of a multi-way stratification. Their approach is compared to existing methods both 
by illustrating the sampling schemes generated for specific examples and by evaluating mean squared 
errors. Variance estimation is also considered. 

Fuller, Loughin and Baker consider regression weighting in the presence of non-response. They 
exhibit conditions under which the regression estimator remains consistent in the presence of 
non-response, and discuss implications for the choice of regressor variables. The ideas are illustrated 
by application to the 1987-88 Nationwide Food Consumption Survey conducted by the Human 
Nutrition Information Service of the U.S. Department of Agriculture. 

The paper by Stasny, Toomey and First gives a description of a survey conducted in 1990 to 
estimate the rate of rural homelessness in Ohio. The possible magnitude of the bias of the estimator 
is investigated by simulating sampling from a variety of synthetic populations. It is found that the 
bias is likely to be small compared to the standard deviation. 
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Issues and Strategies for Small Area Data 


M.P. SINGH, J. GAMBINO and H.J. MANTEL! 


ABSTRACT 


This paper identifies some technical issues in the provision of small area data derived from censuses, administrative 
records and surveys. Although the issues are of a general nature, they are discussed in the context of programs at 
Statistics Canada. For survey-based estimates, the need for developing an overall strategy is stressed and salient 
features of survey design that have an impact on small area data are highlighted in the context of redesigning a 
household survey. A brief review of estimation methods with their strengths and weaknesses is also presented. 


KEY WORDS: Sample design strategy; Design estimates; Model estimates. 


1. INTRODUCTION 


For decades, administrative records and censuses were 
the main sources of data used for policy and planning for 
both large and small areas. These are still the richest source 
of statistical data at small area levels in most countries. 
During the forties and fifties, however, as the reliance on 
sample surveys increased, survey based estimates comple- 
mented the traditional sources because they provide more 
timely and cost efficient statistical data in a variety of 
subject matter fields. Although designed to provide reliable 
estimates primarily at larger area levels such as national 
and provincial, increasingly such surveys are being used 
to meet the growing demands for more timely estimates 
for various types and sizes of domains. No technical 
problem arises as long as these domains are large enough 
(e.g., age-sex groups, larger cities and sub-provincial 
regions) to yield estimates of acceptable reliability. If data 
are needed for small domains, however, particularly if 
such domains cut across design strata, special estimation 
problems arise and several methods have recently been 
proposed to deal with such problems. 

The main message of this paper is to emphasize the need 
to look at the problem of small area data in its entirety. 
Small area needs should be recognized at the early stages 
of planning for large scale surveys. The sampling design 
should include special features that enable production of 
reliable small area data using design or model estimators. 
The handling of this growing challenge to statistical agencies 
at the estimation stage should be viewed as a last resort. 

In section 2, we discuss data needs and the three main 
sources of socio-economic data in the Canadian context, 
namely, the census, administrative records and surveys. 
Section 3 identifies some technical issues regarding the 
three sources of data and highlights the problems of 
quality measures and their interpretation. Then a need for 


developing an overall strategy that includes the planning, 
designing and estimation stages in the survey process is 
highlighted in section 4. Two aspects of the design, namely, 
clustering in a multi-stage sample design and sample 
allocation are discussed. In section 5, we present some 
sample design options being incorporated during the current 
redesign of the Canadian Labour Force Survey, the largest 
monthly household survey conducted by Statistics Canada, 
with a view to enhancing the survey capacity to provide 
better quality small area data. The purpose of section 6 
is to review the many different approaches to estimation 
for small areas. We also suggest some new estimators and 
provide comments on the strengths and weaknesses of 
various domain estimators. A cautious approach towards 
the use of model estimators is stressed. 


2. INFORMATION NEEDS AND 
DATA SOURCES 


As the country’s national statistical agency, Statistics 
Canada plays an integral role in the functioning of Cana- 
dian society. While guaranteeing the confidentiality of 
individual respondents’ data, the agency’s information 
describes the economic and social conditions of the country 
and its people. Its economic, demographic, social and 
institutional statistics programs produce reliable data on 
many aspects of life at the national, provincial, and sub- 
provincial levels for use by federal and provincial govern- 
ments, private institutions, academics and the media. With 
increases in the planning, administration and monitoring 
of social and fiscal programs at local levels, there has been 
increasing demand for more and better-quality data at 
these levels. Three major sources of social, socio-economic 
and demographic data with emphasis on small area 
statistics are briefly discussed below. 


'M.P. Singh, J. Gambino and H.J. Mantel, Statistics Canada, 16th Floor, R.H. Coats Building, Tunney’s Pasture, Ottawa, Ontario, Canada 
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Census of Population: The quinquennial census of 
population provides benchmark data and serves as the 
richest source of information, available every five years, 
for small areas and for various characteristics/domains/ 
target groups of policy interest such as ethnic minorities, 
disabled persons, youth and aboriginal peoples. 

Administrative Records: Administrative records are an 
increasingly important source of statistical data. These are 
extensively used in the demographic field by statistical 
agencies to produce local area estimates (Schmidt 1952, 
Verma and Basavarajappa 1987). In certain areas, such 
as vital statistics, administrative records are the only source 
of information for production of statistics at various levels 
of aggregation. In others, the relative merits of adminis- 
trative records compared to censuses or surveys as data 
sources in terms of timeliness and quality of data deter- 
mine the manner and the extent to which these data sources 
are used. In addition to direct tabulations, administrative 
records are used in a number of programs as a source of 
supplementary information for use in improving the 
quality of survey-based estimates. They are also being used 
in the construction of sampling frames for conducting 
surveys. Examples at Statistics Canada include the Business 
Register and the Address Register of residential dwellings. 

Like the census of population, administrative records 
are very rich in geographical detail, making them a useful 
source of information for small area statistics. They are 
available more frequently and, due to recent technological 
advances, they are becoming a more cost-effective data 
source. However, administrative data are based on defi- 
nitions made for programmatic rather than statistical 
purposes and their content is limited. Details of a Statistics 
Canada program for integration and development of an 
administrative records system to produce statistical outputs 
are given by Brackstone (1987a, 1987b). Experiences in 
the use of administrative records in other countries are 
included in the conference proceedings edited by Coombs 
and Singh (1987). 

Household Surveys Program: Household surveys have 
long been an important source of economic and social 
Statistics at Statistics Canada. Surveys under this program 
may be placed in three groups, namely, (i) the Labour 
Force Survey, (ii) Special Surveys and Supplementary 
Survey Programs and (iii) Longitudinal/Cyclical Surveys. 
These surveys are briefly introduced below indicating the 
scope for small area statistics in general. 

Starting as a quarterly survey in 1945, the Canadian 
Labour Force Survey (LFS) became a monthly survey in 
1952. The information provided by the survey has expanded 
considerably over the years and currently it provides a rich 
and detailed picture of the Canadian labour market. In 
addition to providing national and provincial estimates 
the survey regularly releases estimates for subprovincial 
areas. Regular estimates of standard labour market indi- 
cators are also in great demand for small areas such as 


Federal Electoral Districts, Census Divisions and Canada 
Employment Centres. These estimates are used by both 
federal and provincial governments in monitoring programs 
and allocating funds and other resources among various 
political and administrative jurisdictions. 

Because of cost considerations, the LFS is heavily used 
as a vehicle for conducting ad hoc and periodic surveys 
at the national and provincial levels in the form of supple- 
mentary or special surveys. In the case of supplements, the 
LFS respondents themselves are asked additional questions, 
whereas for special surveys a different set of households 
is selected using the LFS frame. Both special and supple- 
mentary surveys are usually sponsored by other govern- 
ment departments and are conducted on a cost-recovery 
basis. For these surveys, the demands for small area 
statistics differ greatly from survey to survey, and generally 
the demands seem to be less pressing than those from the 
LFS itself. 

Statistics Canada conducts a General Social Survey 
(GSS) annually to serve, in a modest way, the growing data 
needs on topics of current social policy interest. The GSS 
program (Norris and Paton 1991) consists of five survey 
cycles, each covering a different core topic, repeated every 
five years. Because of the limited size of sample (10,000 
households nationally) the focus of the GSS is on estimates 
at the national level and on analytical statistics. 

Longitudinal/panel surveys are new in the Canadian con- 
text. Statistics Canada has started two longitudinal surveys 
that will enrich the household survey program greatly, 
namely, the Survey on Labour and Income Dynamics and 
the National Population Health Survey. Both are large scale 
panel surveys and they are already creating expectations 
for data at sub-provincial and local area levels. 


3. ISSUES IN DOMAIN ESTIMATION 


There are numerous policy and technical issues that 
need to be addressed in the provision of small area 
statistics. The seriousness of these issues may vary from 
agency to agency and from one application to the next 
within the same agency depending on data quality and 
release policies. These issues are relevant for national and 
provincial estimates, but they assume higher significance 
in the context of small area statistics. As Brackstone 
(1987a) notes ‘‘on the issue of small area data evaluation, 
it is worth noting that error in small area estimates may 
be more apparent to users than error in national 
aggregates... at a local area level, there will be critics 
quick to point out deficiencies... it is true that for small 
areas, where estimation is more difficult, scrutiny of 
estimates is also more intensive’’. Several research and 
developmental studies on small area estimation are 
described in two volumes, one edited by Platek et al. 
(1987), and the other by Platek and Singh (1986). For a 
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recent overview of small area estimation techniques 
currently being used in United States federal statistical 
programs see U.S. Statistical Policy Office (1993). 

Use of Administrative Records: Federal and provincial 
government policies are the prime factors that influence 
the supply as well as the demand for small area data in 
most situations. On the supply side, government program 
driven administrative records contain a wealth of statistical 
information that can be used to produce local area data. 
Examples of files being used in the Canadian context are: 
Family Allowance, Unemployment Insurance, Income 
Tax, Health, Education, Old Age Security. Income-related 
statistics are produced at the local area level on a regular 
basis. Any change in government policy and associated 
programs can have immediate impact, for better or worse, 
on the coverage, availability, timeliness or quality of 
statistics derived from the corresponding administrative 
records. On the demand side, as noted earlier, govern- 
ments need local area data for planning, implementing and 
monitoring their policies. 

Conceptual issues: Quite frequently, conceptual and 
definitional issues in a data series are confounded with 
sampling and estimation problems. For example, consider 
the Unemployment Insurance (UI) system in Canada. UI 
regulations stipulate different qualification and requalifi- 
cation periods depending on the unemployment rate ina 
given region such that regions with higher unemployment 
rates require shorter qualifying periods of continuous 
employment. The estimates of regional unemployment 
rates derived from the LFS are used in determining the 
eligibility for an individual to receive benefits. These local 
area estimates are thus continually under close scrutiny by 
the public and the media. Such scrutiny however refers 
more often to conceptual issues rather than estimation 
issues per se; aspects such as the treatment in the survey 
questionnaire of discouraged workers, lay-offs and job 
search methods are questioned. 

Use of Models and Related Quality Measures: Domain 
estimates are produced for virtually all large scale surveys, 
and as long as design estimators, /.e., approximately 
design-unbiased estimators are of acceptable quality, no 
problem arises. We consider two classes of design esti- 
mators. Following Schaible (1992), direct estimators refer 
to estimators which use values of the study variable only 
for the time period of interest and only from units in the 
domain (e.g., the regression estimator with slope estimated 
using only data from the domain). Such estimators may, 
and often do, use information on one or more auxiliary 
variables from other domains or other time periods, and 
are design unbiased or approximately so. The second class 
of design estimators, modified direct estimators, may use 
information from other domains on both the auxiliary and 
the study variable but still retain the property of design 
unbiasedness or approximate unbiasedness (e.g., the 
regression estimator with slope estimated using the whole 


sample). There is a growing literature on indirect (or model) 
estimators, that is, estimators which use information on 
both the study and auxiliary variables from outside the 
domain and/or the time period of interest without any 
reference to their design unbiasedness properties. 

Most producers and users of survey data are accustomed 
to design estimators and the corresponding design-based 
inferences. They interpret the data in the context of repeated 
samples selected using a given probability sampling design, 
and use estimated design-based evs (coefficients of variation- 
square root of design variance divided by the design 
estimate) as the measures of data quality. For situations 
where either domains are too small or the sampling design 
did not foresee production of small area estimates, the 
design estimates may lead to large design cvs and model 
estimates may be the only choice if the survey-based 
estimates have to be provided for individual domains. 
A major challenge for statisticians is how to estimate, 
compare and explain to the users the relative precision of 
estimates from a survey that produces a large number of 
estimates at the national, subnational and large and small 
domain levels, most using design estimators but a few 
using model estimators. The model-based evs (square root 
of design variance of model estimate divided by the model 
estimate) may convey a completely different message and 
may be several times lower than the corresponding design- 
based evs for the same small area and in many cases, lower 
than the design-based evs for much larger areas. 

For model estimators, it is usually straightforward to 
derive expressions for the corresponding mean square 
errors (i.e., design variance + square of the design bias). 
Estimation of these expressions, with an adequate degree 
of reliability, is a different matter. If we follow the argu- 
ment that the data (e.g., sample size) for such domains are 
inadequate for producing design estimates, it is unlikely 
that they would be adequate for producing design estimates 
of the corresponding variances and biases. As the estimation 
of bias is relatively more difficult, some authors seek 
design consistent model estimators, implying perhaps that 
bias can be ignored. However, if the sample size within the 
domain is sufficiently large to make the model estimator 
consistent, then the design estimator itself should give 
reliable estimates for the domain. For model estimators, 
suggestions have been made to use estimates of average 
mean square error computed over all domains. As the need 
for estimates for different domains usually arises because 
these domains are thought to be different from each other, 
a challenging task is to explain why estimates from all such 
domains are given the same degree of reliability. Another 
possibility is to construct indirect model-based estimates 
of the variance and bias of the model estimators for indi- 
vidual domains. Finding suitable methods of estimating 
mean square error for individual domains should be a 
research priority. Another serious concern for survey prac- 
titioners is how to guard against model failures. This 
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suggests a need for research into model validation for 
complex survey situations. Further, for model estimators 
that use data on study variables for periods other than the 
time period of interest, estimates of change over different 
time periods would be of questionable quality; see Schaible 
(1992). Also, model estimators that borrow strength from 
other domains in the larger area will suffer a similar 
drawback when comparing differences in the two domains 
within the large area. 

Issue of Privacy: In order to construct rich data bases 
for providing small area statistics, it is sometimes necessary 
to combine census, survey and/or administrative records. 
This necessitates linkage of records obtained from different 
sources. However, given the public’s concern about 
privacy, record linkages should be carried out only after 
careful examination of all their implications. Under the 
Statistics Act, Statistics Canada may have access to admin- 
istrative records of other departments for statistical pur- 
poses. But even for statistical purposes, as Fellegi (1987) 
notes, ‘‘we should have rigorous and auditable review 
procedures to ensure that we only carry out record linkage 
where the resulting privacy invasion is clearly outweighed 
by the public good from the new statistical information’’. 


4. NEED FOR AN OVERALL STRATEGY 


Even though large scale surveys are designed primarily 
for national and provincial estimates, it is rare that the 
estimates from such surveys relate only to the national/ 
provincial populations as a whole. That is, invariably, such 
surveys are used to produce estimates for various cross- 
classified domains and in some cases for areal domains 
(e.g., subprovincial) as well. In many cases, no special 
attention is paid to achieving a desired level of precision 
at the domain level either at the design or the estimation 
stage as long as the reliability is (believed to be) within 
reasonable limits. Problems arise when the cross-classified 
domain refers to a rare subpopulation or when the areal 
domain refers to a small area in which case either no esti- 
mates are possible/available or the estimates are of ques- 
tionable quality. In a number of cases, this may happen 
simply because not enough attention was paid to these 
needs at the start of the survey planning process. If small 
area data needs are to be served using survey data then 
there is a need to develop an overall strategy that involves 
careful attention to meeting these needs at the planning, 
sample design and estimation stages of the survey process. 
For discussion of the design and estimation aspects, we will 
classify domains into the following two types: 


Planned domains: In sampling terms these are individual 
strata or groups of strata for which desired samples have 
been planned. In the Canadian context these are typically 
subprovincial regions, such as Economic Regions, Unem- 
ployment Insurance Regions, and Health Planning Regions. 


In other cases, such domains could be larger counties, 
districts or similar subprovincial regions. 


Unplanned domains: These are areas that were not iden- 
tified at the time of design and thus may cut across design 
strata. Such domains can be of any size and they may 
create special estimation problems. 


Planning: As noted earlier, the data demands from 
continuing periodic surveys such as the LFS are relatively 
much higher than from ad hoc surveys. In the case of 
periodic surveys that are redesigned every five or ten years, 
a suitable strategy can be developed during survey rede- 
signs, since, in such cases, statistical agencies are usually 
in a much better position to project future small area data 
needs based on past demands. For ad hoc surveys, 
designers should include the establishment of such needs 
as an integral part of objective setting for the survey. Thus, 
in both cases, survey designers should establish the desired 
degree of precision, not only for national and provincial 
level estimates, but also for the domains of interest. 

The first step of a strategy, in terms of the provision 
of small area data, will depend on the extent to which 
domains are identified in advance so that they can be treated 
as planned domains at the time of the design (or redesign) 
of the survey. If budgetary considerations do not permit 
reliable estimates for certain very small domains, then the 
option of either collapsing domains, pooling estimates over 
different surveys or not providing the estimates at all should 
be given serious consideration by survey designers in discus- 
sions with the survey sponsors. Some domains cannot be 
determined in advance. These unplanned domains should 
be handled through special estimation methods. 

Sample design: In practice, it is rare that a design is 
optimal either for the national or provincial levels or for 
a single subject matter of interest. Usually varying degrees 
of compromise are introduced at different stages of 
sampling and the data collection process to satisfy theo- 
retical and operational constraints. Depending on the data 
needs, estimates for domains should also form an integral 
part of this compromise. We will discuss two ways of taking 
small area data needs into account at the design stage, 
namely, sample allocation and the degree of clustering of 
the sample. 

Allocation Strategy: In general, an optimum allocation 
strategy for national level estimates allocates samples to 
provinces approximately in proportion to their population. 
The reliability of estimates for smaller provinces in such 
cases suffers. Therefore a compromise allocation is usually 
preferred. There are different ways in which this compromise 
can be achieved depending on the emphasis placed on sub- 
national estimates. Small reductions in sample sizes for larger 
provinces usually have little effect or the reliability of data 
for such provinces (or the national level data) but the 
corresponding sample increase in smaller provinces has 
significant impact on the reliability of their data. 
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The same principle holds for planned domains within 
the provinces. This is because optimum allocations in most 
situations are flat and the designers can exploit this feature 
by reallocating sample from the larger areas to planned 
domains that are smaller in size. 

Clustering: Large scale household surveys usually 
involve stratified multistage designs with relatively large 
primary sampling units in order to make the design cost- 
efficient for national and provincial statistics. Such designs 
are thus highly clustered and, therefore, detrimental to the 
production of statistics for unplanned areal domains in the 
sense that, due to chance, some domains may be sample- 
rich while others may have no sample at all. Given the 
importance of domain estimates, attempts should be made 
to minimize the clustering in the sample. The following 
factors are important in this context: choice of frame, 
choice of sampling units and their sizes, number and size 
of strata and stages of sampling. The goal should be to 
make the design effects as low as possible given the oper- 
ational constraints. 

Estimation: No matter how much attention is paid to 
domain estimates at the early stages of planning and 
designing a particular survey, there will always be some 
smaller domains for which special estimation methods will 
be required for producing adequate estimates. Recently, 
synthetic estimators, which borrow strength from domains 
that resemble the domain of interest, have attracted a good 
deal of attention. However, since synthetic estimators are 
very sensitive to the assumption that domains resemble 
each other, even a small departure from the assumption 
can make the design bias high and put their use in question. 
Probability samplers, conscious of design bias, have sug- 
gested combinations of direct and synthetic estimators, 
with a view to addressing the design bias problem while 
trying to retain the strengths of the synthetic estimator. 
Empirical Bayes and similar techniques have been used to 
assign a weight to each component in the combined esti- 
mators. A brief review of these developments is given in 
section 6 on estimation. 


5. SAMPLE DESIGN CONSIDERATIONS 


5.1 Introduction 


The small area problem is usually thought of as one to 
be dealt with via estimation. However, as was noted in the 
previous section, there are opportunities to be exploited 
at the survey design stage. This section uses the Canadian 
Labour Force Survey (LFS) to illustrate this. 

The current LFS design: The Canadian Labour Force 
Survey is a monthly survey of 59,000 households which 
are selected in several stages using various methods. The 
ultimate sampling unit, the household, remains in the 
sample for six months once it is selected and is then 


replaced. Higher stage units (primary sampling units 
(PSU), clusters) also rotate periodically. Each of Canada’s 
ten provinces is divided into economic regions (ER) which 
the LFS further divides into self-representing areas 
(medium and large cities) and non-self-representing areas 
(the rest of the ER). Stratification and sample selection 
take place within these areas, and the number of stages of 
sampling as well as the units of sampling differ between 
these two types of area. For example, in areas outside 
cities, there are three stages of sampling, whereas there are 
only two in the cities. For a detailed description of the 
current LFS design, refer to Singh ef a/. (1990). 


5.2 Sampling Stages and Sampling Units 


Area frames are usually associated with clustered 
sampling, i.e., the first-stage units of selection are typically 
land areas containing a number of second-stage units. If 
a list of the second-stage units becomes available, then 
sampling directly from the list becomes possible, leading 
to a less clustered sample. This will result not only in 
improved estimates (due to lower design effects) but also 
in better small area estimates for unplanned domains. The 
latter holds since, by spreading the sample more evenly, 
it is more likely that an unplanned areal domain will 
contain some selected units. In contrast, in a clustered 
design we are often faced with a situation where one 
domain has sufficient sample because it happens to contain 
sampled clusters while a similar domain happens to have 
too few or no sampled clusters to produce good estimates. 

To reduce clustering in the LFS we investigated two 
options: (i) the possibility of replacing the area frame (with 
its two stage design) in the larger cities with a list frame 
using the Address Register and (ii) reducing the sampling 
stages in rural areas and smaller urban centres. The Address 
Register, created to improve the coverage of the 1991 
Canadian census (Swain, Drew, Lafrance and Lance 1992), 
consists of a list of addresses, telephone numbers and 
geographical information for dwellings by census enumer- 
ation area (EA). One option involved the selection of a 
stratified simple random sample of dwellings from the 
Address Register frame. This sample could then be sup- 
plemented with a sample selected from a growth frame 
which comprises a set of dwellings that are not in the post- 
censal address register. Handling of growth became the 
major stumbling block in pursuing option (1) as no cost- 
efficient method could be devised and tested in time for 
the current redesign. However, an updating strategy for 
the post-censal Address Register is still being investigated 
for future censuses and surveys. 

With regard to option (ii), in keeping with the idea that 
less clustering is better for small area estimates, changes 
in the units and reduction in the stages of sampling were 
investigated for the areas outside the cities. Due to the 
changes that have taken place in data collection techniques, 
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namely, from face-to-face interviewing to telephone and 
computer assisted interviewing, the cost-variance analyses 
from the past are no longer relevant. More than 80 percent 
of LFS interviews are now conducted by telephone. With 
the increase in telephone interviewing and the resulting 
decrease in travel, it became feasible in almost all cases to 
eliminate the current PSU stage and to sample EAs directly. 


5.3 Stratification 


One approach to stratification, similar in spirit to the 
above discussion on PSU size, is to replace large strata by 
many small ones. The hope is that a redefined domain or 
an unplanned domain will contain mostly complete strata. 
This will make the sample size in the domain more stable. 

There may be several overlapping areas for which esti- 
mates are required. For example, each Canadian province 
is partitioned into both Economic Regions (ER) and 
Unemployment Insurance regions (UIR). One way to deal 
with this situation is to treat all the areas created by the 
intersections of the partitions as strata. In the Canadian 
case, for example, the 71 ERs and 61 UIRs yield 133 inter- 
sections, a manageable number. In some cases, however, 
the number of intersections may be too large to handle 
effectively. In addition, some of the intersections may have 
very small populations, making them unusable as strata. 

By combining decreased clustering with smaller strata, 
we hope to have a design which is better able to meet small 
area needs. For example, the design should provide more 
flexibility in satisfying both ER and UIR requirements 
efficiently and in dealing with future changes in the defi- 
nition of regions. 


5.4 Allocation 


If the definitions of small areas are known in advance, 
we may be able to treat them as planned domains and take 
them into account when designing the survey. The survey 
designer may endeavour to allocate sufficient sample in 
each small area to make the production of reliable estimates 
feasible. For large surveys such as the Canadian Labour 
Force Survey, this approach can, at least in theory, make 
the production of a great many small area estimates fea- 
sible. With a monthly sample of 59,000 households, and 
assuming that, say, 100 households per month are needed 
to produce reliable quarterly estimates, the country can 
be divided into about 600 non-overlapping areas, each 
guaranteed to have sufficient sample. Unions of such areas 
will also have enough sample to produce reliable monthly 
estimates. 

Various sample allocation strategies are possible. Ina 
top-down approach, once a provincial sample size is deter- 
mined, the sample is allocated among the sub-provincial 
regions. However, it may turn out that it is not possible to 
satisfy the requirements for the reliability of sub-provincial 


estimates for the given provincial sample size. In a bottom- 
up strategy, the sample would be allocated to sub-provincial 
regions first in such a way that reliability objectives for 
each region are satisfied. As a result, we would expect 
comparable sample sizes in each sub-provincial region. 
This approach may result in a provincial sample size that 
is bigger than the one specified in the top-down approach. 
Regardless of which of the two strategies is used, adjust- 
ments to the initial allocations will usually be required. The 
resulting allocation will likely resemble a compromise 
between proportional allocation and equal allocation. In 
practice, the survey designer must perform a complex 
juggling act among provincial reliability requirements, 
sub-provincial requirements for one or more sets of 
regions, total survey costs and in-the-field details. 

The approach taken in the current LFS redesign may 
be useful in other surveys as well. The sample was allocated 
in two steps: first, a core sample of 42,000 households was 
allocated to produce good estimates at the national and 
provincial levels; then the remaining sample was allocated 
to produce the best possible sub-provincial estimates. The 
resulting compromise allocation will produce reliable 
estimates for almost all planned domains. The compromise 
resulted in only minor losses at the provincial level and 
substantial gains at the subprovincial level. For example, 
the expected CVs for ‘unemployed’ for Ontario and 
Quebec are 3.2 and 3.0 per cent, respectively, instead of 
2.8 and 2.6. The corresponding figures for Canada are 
1.51 and 1.36. Optimizing for the provincial level yields 
CVs as high as 17.7 per cent for UI regions. With the 
compromise allocation, the worst case is 9.4 per cent. 

Sample redistribution: There is usually some scope for 
moving sample from one area to another. For example, 
reducing the sample size by 1,000 households in a large 
province and making a corresponding increase in a small 
province will cause a marginal deterioration in the quality 
of provincial estimates in the former but will improve the 
estimates in the latter significantly. Similar movements of 
sample can be attempted within province. 


5.5 Other Considerations 


Change in definitions of small areas: Survey designers 
are faced with the fact that the definitions of planned 
domains may change during the life of a design and they 
may then have to treat the new domains as unplanned 
domains. For example, it is quite possible that the defini- 
tions of Unemployment Insurance Regions will change 
two or three years after the new LFS design is introduced 
in 1995. To deal with this at the design stage, the best that 
the survey designer can do is to choose as building blocks 
areas which are standard (e.g., census-defined areas whose 
definitions are fairly stable) and hope that the redefined 
regions are unions of these standard areas. This is the 
approach that was taken in the current LFS redesign. 
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An alternative is to adopt an update strategy. This 
entails a reselection of units, doing it in such a way that 
the overlap between the originally selected units and the 
newly selected ones is maximized. By taking this approach, 
the number of new units that have to be listed is minimized. 
This also minimizes other field disruptions such as the need 
to hire new interviewers. 


6. ESTIMATION 


The purpose of this section is to review some of the 
different approaches to estimation of totals for small 
areas. No attempt is made to provide an exhaustive review; 
the discussion indicates the trend of developments in small 
area estimation research. For a detailed review, see the 
recent paper by Ghosh and Rao (1993). To facilitate this 
review we will classify small area estimation methods into 
two types. This is just one of many possible classification 
schemes. The first class of estimators we call design esti- 
mators, i.e., (approximately) design unbiased estimators, 
which includes direct and modified direct estimators. As 
noted earlier, design estimators are often unsatisfactory, 
having a large variance due to small sample sizes (or even 
no sample at all) in the small areas. The second class we 
call indirect (or model) estimators, and it includes synthetic 
and combined estimators. Some of these estimators are 
compared empirically in an earlier version of this paper 
by Singh, Gambino and Mantel (1992). 


6.1 Design Estimators 


Direct Estimators: Direct small area estimators are 
based on survey data from only the small area, perhaps 
making use of some auxiliary data from census or adminis- 
trative sources in addition to the survey data. The simplest 
direct estimator of a total is the expansion estimator, 


ay Wi Ji (6.1) 
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where s, is the part of the sample in small area a and w; 
is the survey weight for unit 7. This estimator is unbiased; 
however, it may have high variability due to the random 
sample size in area a. 

If the population size N, is known then a post- 
stratified estimator, 


ae = dy Dh wai De Wh I NeAL closes = aJe,as 
1€Sq 1€Sq (6 2) 
may be used. This estimator is more stable than the expan- 


sion estimator; however, there may be some ratio estimation 
bias in complex surveys. 


If the sampling scheme is stratified and the N,,,, are 
known, where N,,, is the population size in stratum h 
and small area a, an alternative post-stratified esti- 
mator is Yorpsta = LA Nia Liesh,a Wiyj/ Lies q Wi) 
YaNna Ynea!Nnea = LnNnaIna- The strata may also 
be post-strata instead of design strata. 

Ratio estimation is similar to post-stratified estimation, 
the difference being that another auxiliary variable is used 
in place of the population counts N, and N,,. For 
example, if x is a covariate for which the small area totals, 
X,, or the stratum small area totals, X;,,, are known then 
we may define the ratio estimators 


Ine = XR, and yo — Me KG Raa (6.3) 
h 

where R, = = WIN ba is an estimate of the ratio Y,/X, 

and Ryq = or Pi RON 


A regression estimator attempts to account for dif- 
ferences between small area subpopulation and subsample 
values of the covariates via an estimated regression rela- 
tionship between the variate of interest, y, and the 
covariates, x. An advantage of regression type estimation 
is that it is easily extended to vector covariates. The 
estimator is given by 


Vero = Le oe ee 7a Xs (6.4) 


where Y, may be an expansion or post-stratified estimator, 
X, must LE calculated in the same May poe Cae me 
viesg Vi I WiyiX/ (utes oie ' w.x;x/ } ~| where v; are given 
wean for the regression. Note that 6, = = R, when x is 
scalar and v; = x;. When Y, and_X, are expansion esti- 
mators this estimator is also called the generalized regres- 
sion estimator. Approximate design unbiasedness of this 
estimator follows from that of Y, and X,. 

As with the ratio type estimators, regression type 
estimation may also be applied within design strata or 
post-strata. 


Modified Direct Estimators: Modified direct estimators 
may use survey data from outside the domain; however, 
they remain approximately design unbiased. By a modified 
direct estimator we mean a direct estimator with a syn- 
thetic adjustment for model bias; since the adjustment 
would have approximately zero expectation with respect 
to the design, the modified estimator is approximately 
design unbiased if the direct estimator is. An example is 
obtained by Replacing By i in (6. = by a synthetic estimator 
B = Dies; (wiaix; i Bisre yay ‘}—!; we will denote 
this estimator by View ne B ‘sould generally be more stable 
than Bus the choice between them would depend on the 
size of the variance of @, relative to the variation in the 
B,S over areas a. A compromise is to take a weighted 
average \, 6, + (1 — A.) where A, is suitably chosen; 
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options for the choice of \, are discussed under combined 
estimators in Section 6.2. A second example is obtained 
by replacing @, in (6.4) by R = Y,/X,; note that R isa 
special case of 8 where x is scalar and v; = X;. 


6.2 Indirect Estimators 


Synthetic Estimators: Synthetic estimation methods are 
based on an assumption that the small area is similar in some 
sense to another area, often a larger area which contains 
it. Estimates for the other area would generally be more 
reliable than those for the small area. The resulting synthetic 
estimator would then have small variance, though it may 
be badly biased if the underlying assumption is violated. 

One of the simplest synthetic estimators arises from the 
assumption that the small area mean is equal to the overall 
mean. This leads to the mean synthetic estimator 


= N, \) wiai/ 9 w; = N.S. (6.5) 
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A more common synthetic estimator is based on stratifica- 
tion or post-stratification, 


Y syn, st,m,a 
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As with direct estimators, ratio synthetic estimation 
may be based on other auxiliary data besides the popula- 
tion counts N, or N;,,. For example, the common ratio 
synthetic estimators based on a covariate x are defined as 


LAP and Ae ea SD Xn ies 
A (6.6) 


where Y, = Yic,w;,y; is the expansion estimator of the 
population total for y and Yj. = Lies, Wii. Xe and Xj, 
are similarly defined. These estimators have been studied 
by Gonzalez (1973), Gonzalez and Waksberg (1973) and 
Ghangurde and Singh (1977, 1978), among others. 
Singh and Tessier (1976) suggested an alternative ratio 
synthetic estimator, using X instead of ee defined as 


ee pe Go APs (6.7) 


Both Leis and Meoryt have the same synthetic bias 
and the ratio bias in Vien will be negligible for large 
samples. The choice between these two estimators depends 
on p, the correlation of Y, and_X,. It can be shown that 
forilargersamples V.(¥L,, an =) YS, lt pr 
0.5c,/c,, where c, and c, are the coefficients of variation 
of X, and Y,, respectively. In most cases, when p is high 
or the population is skewed, Ve would be preferred; 
however, when c, is high and the correlation is only 
moderate, Vinee may be the better choice. 


In some situations information on a second auxiliary 
variable (z) in addition to x may be available. Then a 
bivariate ratio synthetic estimator may be constructed: 


Fees tyyXp Vii Ry ly) Lye Zohar (6:8) 


where y, is suitably chosen. Extensions to a multivariate 
ratio synthetic estimator may be considered following 
Olkin (1958). 


Regression synthetic estimation is similar to ratio 
synthetic, 


Neyniresia = BX, 


= 
B= YO vy  wjyix/ { a vw} - (6.9) 
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Again, regression synthetic estimation may also be applied 
within design strata or post-strata. Royall (1979) suggested 
a slight variation, Yoyn,roy,a = Lies,¥i + B(Xa — Lies,Xi)» 
where the sum of y-values for only units not included in 
the sample is estimated synthetically. 


Remark: The examples of modified direct estimators 
presented in Section 6.1 can also be considered to be 
ratio or regression synthetic estimators with a design- 
based adjustment to correct for bias. For example, we 
be write: Yes er ghey Fad p Sy powhiere 

at aXe is an estimate of the bias of Y, Oirer. go> 
ve sreg,q CaN also be written as the Royall estimator, 


Yor ee. with a design-based adjustment for bias. 


Purcell and Kish (1980) discuss another type of synthetic 
estimation which they call SPREE (structure preserving 
estimation) for small area estimation of frequency data. 
Detailed historical counts, perhaps from a census, are 
combined with less detailed current survey estimates to 
produce detailed estimates of current counts. The assump- 
tion here is that certain relationships among the detailed 
counts are stable over time. 


Combined Estimators: By a combined estimator we 
mean a weighted average of a design estimator and a 
synthetic estimator, 


Pen a Na Ye =a fees Xa) Vip ies (6.10) 


where X, is suitably chosen. The aim here is to balance the 
potential bias of the synthetic estimator against the insta- 
bility of the design estimator. There are three broad 
approaches which may be used to define the weights \, in 
(6.10); they may be fixed in advance, sample size depen- 
dent, or data dependent. 

The first and simplest approach to weighting is to fix 
the weights in advance, for example, to take a simple 
average. However, this does not make any allowance for 
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the actual observed reliability of the design estimator. For 
some realized samples the design estimator for small area 
a is more reliable than for other realized samples. The 
weight given to the design estimator should reflect this. 

The second general approach to weighting of the design 
and synthetic parts is called sample size dependent, in 
which the weights are functions of the ratio Nog/N,- 
Another possibility, not considered here, is to base the 
weights on the realized sample values of a covariate x; for 
example, the weight could be a function of X jes, a/Xzq or 
of S52 poh q Where S2 a is the realized variance of X¢es.q5 
conditional on Noa or some other relevant aspect of the 
realized sample, and On is the unconditional variance 
of Xées,a ‘ 

Some specific estimators in this class have been proposed 
earlier. Drew, Singh, and Choudhry (1982) proposed the 
sample size dependent estimator 


Yssd,ra = Neva Heath sic: ha) Metatasars (6.11a) 
where 
1 i a No SON, 
Ng = { H os “ — .11b) 
Nea/5Na otherwise 


and 6 is subjectively chosen to control the contribution of 
the synthetic component. Sarndal (1984) suggested 


Yssd,reg,a =), Lea Heal Lyi Ag) Vented D (6.12) 
where \, = N,./N,. Rao (1986) suggested a modifica- 
tion to this in which )\, would be taken to be 1 whenever 
Nog = Nz. Sarndal and Hidiroglou (1989) refined Rao’s 
suggestion by taking \, = (N2/N,)"~! when Nog < Ng, 
where h is chosen judgementally to control the contribu- 
tion of the synthetic component. 

It is the bias of the synthetic component that is of 
concern when using these sample size dependent estimators 
in practice. The weight associated with the synthetic 
component should be such that the bias is kept within 
reasonable limits. For example, the sample size dependent 
estimator of Drew, Singh and Choudhry (1982), with 
generalized regression estimation replacing the ratio 
estimation and 6 = 2/3, is currently used in the Canadian 
Labour Force Survey to produce domain estimates. For 
a majority of domains the weight attached to the synthetic 
component is zero as the direct estimator itself provides 
the required degree of reliability. For other domains the 
weight attached to the synthetic component is about 10% 
on average and never exceeds 20%. Depending on the risk 
of bias that one is willing to take, 6 may lie in the range 
{2/3,3/2] for most practical situations. 

The third approach to weighting we call data dependent. 
The optimal weights for combining two estimators generally 
depend on the mean squared errors of the estimators and 


apy 


their covariance. These quantities would generally be 
unknown but may be estimated from the data. For our 
combined estimators this would usually require some 
modelling of the bias of the synthetic part. An early and 
well known example of this approach is due to Fay and 
Herriot (1979). They model the biases of the synthetic 
estimators for the small areas as independent random 
effects with an unknown but fixed variance. To be more 
specific, if Yye.,q is the design estimator then they consider 
the model ¥, = X,8 + a, and Yyosq = Y, + €q where 

a ~ (0,07), €, ~ (0,v2), and a, and €, are independent 
and uncorrelated over a, o” is unknown and v? are assumed 
known (in practice they would need to be estimated). For 
a given value of o7 the optimal weights for combining 
You and X, 6 can be calculated. An estimate of o? is 
obtained by the method of fitting constants and substituted 
into the optimal weights. Some protection against model 
mis-specification is obtained by truncating the resulting 
estimate if it deviates from the direct estimate by more than 
a specified multiple of v,. Schaible (1979) and Battese 
and Fuller (1981) also consider empirically estimated 
optimal weights \, in (6.12) based on similar random 
effects models for the small area totals. 

Prasad and Rao (1990) provide an estimator of the 
mean square error of the Fay-Herriot estimator which 
makes allowance for the estimation of the variance com- 
ponents. Kott (1989) proposes a design consistent estimator 
of the mean square error, but finds it to be very unstable. 

Another alternative is to use historical data to calculate 
the weights; this has the advantage that the weights may 
be more stable than if they are estimated from current 
survey data; however, there is an underlying assumption 
that the optimal weights are stable over time. 


Remark: In sample size dependent estimation the 
weights are allowed to depend on the observed size of 
the subsample s,, but not on the values of the variate 
of interest. This non-dependence of the weights on the 
variate of interest has advantages and disadvantages. 
An advantage is that the same weights would be used 
for estimation of totals for all variates of interest; they 
need to be calculated only once. More importantly, 
the estimate of the sum of two variables is the sum of 
the estimates of the two variables. A disadvantage is 
that the weights do not directly take account of either 
the reliability of the design estimator for the variate 
of interest or the likely magnitude of the bias of the 
synthetic estimator. 


Combining data over time: For repeated surveys pooling 
of data over survey occasions to increase the reliability of 
estimates is acommon practice. Depending on the rotation 
pattern used for such surveys, significant gains in relia- 
bility can be achieved. This pooling or averaging over time 
is thus of particular interest in the context of domain 
estimation where reliability is usually low. For domain 
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estimation in the Canadian Labour Force Survey it is 
normal practice to use a sample size dependent estimator 
based on three month average estimates of employed and 
unemployed. Due to the six month rotation scheme used, 
as noted earlier, averaging over three months increases the 
sample size by one third. If samples completely overlap 
between periods then averaging does not result in any gain 
in efficiency. For other rotation patterns the sample size 
for domain estimates could be more than doubled through 
this process. There is, however, a conceptual problem with 
pooled estimates, in that such estimates refer to an average 
of the parameter of interest (e.g., unemployment) over a 
period of, say, three months. 

In composite estimation the current design estimator 
is combined with the composite estimator for the previous 
period, updated by an estimate of change based on the 
common sample. This idea was used, though not in the 
context of small area estimation, by Jessen (1942), and 
Patterson (1950), among others. Binder and Hidiroglou 
(1988) provide a review. The weights for the combination 
are typically estimates of the optimal weights under the 
assumption that these weights are time stationary. These 
data dependent weights have the disadvantage that they 
lead to inconsistency of estimates for different charac- 
teristics and their sums. 

A recent development in small area estimation tech- 
niques is the use of time series methods for periodic 
surveys. The relationship between parameters of interest 
for different time periods is modelled and this model is 
exploited to improve the efficiency of the estimates for the 
current occasion. In most cases some allowance must also 
be made, through modelling or otherwise, for the non- 
independence of samples for different survey occasions 
due to the sample rotation scheme. Some references for 
this time series approach are Choudhry and Rao (1989), 
Pfeffermann and Burck (1990), Singh, Mantel and Thomas 
(1994) and Singh and Mantel (1991). All of these are 
generalizations of the Fay-Herriot model which allow the 
regression parameters, small area effects, and survey 
errors to evolve over time according to various time series 
models. The vector of small area estimates that results 
from this approach can be written as a weighted average 
of the vector of design estimates and a vector of synthetic 
estimates which are based on past data and the current 
values of covariates; however, the matrix of weights would 
not generally be diagonal so that the estimator for any 
single small area would generally depend also on the design 
estimates and synthetic estimates for other small areas. 


7. CONCLUSION 


To produce adequate survey-based domain estimates 
that are timely and up to date, sample designers must face 
several challenging tasks. The first is to convince the 


sponsors/program managers that some small area data 
needs cannot be met as a by-product of a system designed 
optimally for national/sub-national estimates. Significant 
gains, which may vary from survey to survey, can be 
achieved at the domain level at a marginal reduction in 
reliability at higher levels. There is a need to develop an 
overall strategy that incorporates desired reliability for the 
planned domains as well as for higher levels through 
compromise allocations, and reduced clustering to help 
improve estimates for unplanned domains. It should be 
noted that many of the planned domains at design time 
may become unplanned (revised) over time in the context 
of continuous surveys. 

The overall strategy should also include consideration 
of both design estimators for larger domains and model 
estimators for small domains. A model estimator should 
be preferred over a design estimator only if its mean square 
error (design variance + bias?) is estimable and it is suffi- 
ciently smaller than the corresponding variance of the 
design estimator. We should have estimates of mean 
square error for each of the individual domains. An option 
that statistical agencies can exercise is to pool similar 
domains or pool estimates over different time periods for 
the same domain. They may even suppress estimates for 
some domains on account of data reliability or privacy 
concerns. 

The second challenging task for statisticians is to explain 
to users the different types of measures of reliability for 
different sets of estimates from the same survey. It is 
hoped that with more research on model validation and 
better estimates of mean square errors, designers will get 
more confidence in using model estimators for small 
domains. In the meantime model estimators should be 
used with caution even if they have significantly smaller 
coefficients of variation. 

Censuses, supplemented by data from administrative 
records, are likely to remain the primary source of small 
area socio-economic data, especially for countries having 
a quinquennial census of population and housing. Also, 
concerns about problems with conceptual issues in the 
context of data for administrative records are likely to 
continue until statistical agencies are given an opportunity 
to influence the development of the forms used to collect 
such data. Until then, this immensely rich data source 
cannot be fully exploited for statistical purposes and more 
so for domain estimation. 
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COMMENT 


W.A. FULLER! 


The authors are to be congratulated on an excellent 
description of the design and estimation considerations 
associated with domains. The authors discuss estimation 
for planned domains, particularly situations in which 
domain membership can be identified in the frame, and 
estimation for unplanned domains including domains for 
which the domain membership cannot be determined from 
the frame. This is a fine contribution to the growing 
literature on domain estimation. 

The authors give a particularly good description of the 
planning, data collection, and processing activities associ- 
ated with surveys conducted by Statistics Canada. Included 
are the traditional design problems of balancing needs for 
domain estimation with desire for efficiency at higher 
levels, the importance of confidentiality in using adminis- 
trative records in constructing domain estimates, and the 
importance of definitional compatibility in attempting to 
combine information from different sources. 

The importance of considering domain estimation at the 
design stage is very well taken and is a point often ignored 
by authors concentrating on small area estimation. As the 
authors emphasize, careful design can often enable one to 
construct estimates for domains in a direct and design con- 
sistent manner. I am sure that those actually designing surveys 
have considered the importance of clustering when designing 
surveys that will be used for domain estimation, but it is 
pleasant to see an explicit discussion. 

The authors describe several types of estimators for 
domains. Their classification emphasizes the number of 
alternatives available to the practitioner. It is possible to 
use the theoretical mean square errors to provide infor- 
mation on the relative merits of the estimators. As an 
example of such a comparison, assume a simple random 
sample of size n selected from a population divided into 
K domains. Assume that the domain sizes and the domain 
means of an auxiliary variable, X, are available. Consider 
the three regression estimators of the domain mean, 


and 


where 
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n; is the number of observations in domain /, N; is the 
population size of domain /, ,; is the population mean of 
X for domain /, and py, is the grand population mean of 
X. In the authors’ terminology, the first estimator is a 
direct regression estimator, the second is a modified direct 
estimator, and the third is a synthetic estimator. We have 
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The estimator ji,;),; uses only information in the sample 
of n; observations. Hence, all properties of the estimator 
are functions of n; and of the domain parameters. The 
regression bias is order n; _' and the variance is order n;_'. 
The estimator fi (2); uses the domain means, but the entire 
sample to estimate the regression coefficient. Hence, the 
basic variance remains order n;_! and will be larger than 
the basic variance of j,;)); in those situations where 6; # 6. 
However, the second order contribution to the variance 
is order nj ' n7! for fi,z)y; and is order nj? for fiqyyi- 
Also, the regression bias for fi,2)y; is order n~'. If the 
domains were strata, j1(;)y; might be called the separate 
regression estimator and i,2),; might be called the com- 
bined regression estimator. 

The estimator /i,3)); 18 a synthetic estimator and has a 
variance of order n~' instead of the order n;_! variance 
of the first two estimators. The cost of this reduction in 
variance is that the bias is order one. Only if the regression 
line is the same for the domain as for the entire population 
will the bias be zero. 

The average mean square error of the three estimators 
for any subset of small areas can be estimated. If the n; 
are small, the estimated variances will provide only limited 
information for discriminating among estimators. Like- 
wise, there is only one degree of freedom for bias squared 
for one particular domain. However, a large domain 
deviation, relative to the standard error, will lead one to 
reconsider the synthetic estimator. 

In their discussion of models, the authors stress the 
importance of providing estimators of the reliability for 
small area estimators. They allude to the fact that the prin- 
cipal estimators of mean square error for model based 
procedures are estimators of an average mean square 
error. While this is true, it seems worth mentioning that 
components-of-variance procedures do not assume the 
mean square errors to be the same in each domain. Also, 
for the typical survey situation, the estimators of mean 
square error need not be constant over domains. For 
example, one of the terms in the mean square error esti- 
mator of the components of variance procedure is the esti- 
mator of the variance of the direct estimator. The estimated 
variance of the direct estimator will be a function of the 
domain sample size and can also be a function of the direct 
estimated variance of the direct estimator for that domain. 
See Battese, Harter, and Fuller (1988), Harville (1976), 
Prasad and Rao (1990), and Ghosh and Rao (1993). 


In their discussion of designs, the authors explain that 
the variance function is often relatively flat in the vicinity 
of the optimum allocation to strata. A slight reallocation 
of sample among strata can markedly increase the effi- 
ciency of domain estimators for a relatively small decrease 
in the efficiency of the overall estimates. The same is true 
with respect to the combination of direct and synthetic 
estimators. Thus, if one has a relatively good idea of the 
variance component associated with small areas, either 
from a previous study on the same population or from a 
study on a similar population, and if one is under pressure 
to produce estimates in a brief time span, then it is reason- 
able to assign fixed weights to form the linear combina- 
tion. The loss in efficiency is apt to be modest and the 
programming required for estimation construction consid- 
erably reduced. One estimator in this class, and the one 
adopted by many practitioners, is the synthetic estimator. 

The authors briefly raise the question of internal con- 
sistency associated with the construction of small area 
estimates. As they say, if one uses a data dependent pro- 
cedure, such as variance components, for each dependent 
variable, then one produces estimates that are not inter- 
nally consistent. One option is to use multivariate pro- 
cedures. See, for example, Fuller and Harter (1987) and 
Fay (1987). Another procedure suggested by Fuller (1990) 
is to construct components of variance estimators for a 
limited subset of variables and then use these estimates as 
control variables in a regression procedure. The regression 
procedure produces weights for the individual observa- 
tions. Once the weights are constructed, any number of 
output tables can be constructed and all estimates are inter- 
nally consistent. 

It is my observation that the gains made in most prac- 
tical domain estimation problems come primarily from the 
wise use Of auxiliary information. Thus, effort directed 
towards obtaining quality auxiliary information is effort 
well spent. If we are able to find a variable x that is highly 
correlated with the variable y, then there is less variability 
remaining to be allocated between area to area variance 
and sampling variance. 
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COMMENT 


GRAHAM KALTON! 


As Singh, Gambino and Mantel (SGM) indicate, there 
is a growing demand for surveys to provide domain esti- 
mates for domains of various sizes and types. This demand 
is being experienced in many countries throughout the 
world. In part it may simply reflect a natural growth in 
the sophistication of survey analysts, who once were 
content with national estimates and estimates for a few 
major domains, but who now want to compare and con- 
trast estimates for many different types of domain. In part 
it results from the needs of policy makers, who require 
domain information in order to examine how current 
policies affect different domains, to predict what effects 
changes in policies might have, and for policy implemen- 
tation. Information on administrative area domains (e.g. , 
provinces or states, counties, and school districts) is of 
particular interest for policy purposes (e.g., for identifying 
low income areas for government support). 

In some circumstances the need for domain estimates 
of adequate precision can be satisfied within the design- 
based inference framework that is standardly used in the 
analysis of survey data. This holds for large domains for 
which the sample sizes are adequate to give the precision 
required. It can also hold for small domains provided that 
they are identified in advance, and the sample design is con- 
structed in a way that provides adequate sample sizes. Thus, 
for example, in the United States, the National Health and 
Nutrition Examination Survey and the Continuing Survey 
of Food Intakes by Individuals use differential sampling 
fractions by age, sex and race/ethnicity and by age/sex and 
low income status, respectively, in order to provide adequate 
samples for the domains created by the cross-classifications 
of these variables. The U.S. Current Population Survey 
employs differential sampling fractions across the states 
in order to be able to produce state-level employment 
estimates. The limitation of this approach is evident when 
there is a large number of small domains, in which case 
the sum of the required sample sizes for each domain pro- 
duces an extremely large overall sample size. This situation 
occurs often with small administrative districts, such as 
counties, school districts, and local employment exchanges. 
In such cases, it may be necessary to discard the standard 
design-based inference approach in favor of a model- 
dependent approach that employs a statistical model in the 
estimation process to borrow strength from data other than 
that collected in the survey for the given small area. The 
model-dependent approach may also be required for 
unplanned small domains, where the need for oversampling 
had not been foreseen at the design stage. 


In response to the demand for small area estimates, a 
sizeable literature has developed on model-dependent 
small area estimation methods. Little has, however, been 
written on the broader issues of small area estimation 
discussed in the SGM paper, issues that need more atten- 
tion. Like the authors, I believe that a cautious approach 
should be adopted to the use of model-dependent small 
area estimators. I therefore welcome their discussion of 
methods to make small area estimates within the design- 
based framework. 

From my perspective, the first approach to making 
small area estimates is to see whether estimates can be 
produced with adequate precision within the design-based 
framework. If the domains have been identified in advance, 
consideration should be given to designing the sample to 
meet the needs for small area estimates. This may involve 
ensuring that the small areas do not overlap strata, and 
ensuring a sufficient sample size for each small area. 
Another approach suggested by SGM is to minimize the 
amount of clustering. The smaller the amount of clustering, 
the less the sample size in each small area is subject to the 
vagaries of chance. In this regard I see the benefits of less 
clustering as mainly directed at providing the ability to 
produce estimates for small areas that were not identified 
at the design stage. When small areas for which estimates 
are planned are made into separate strata, the sample size 
in each small area should be under adequate control even. 
with a clustered sample (provided that the measures of size 
used in the PPES sampling are reasonable). However, even 
with planned estimates, there will often be an issue of how 
to compute variance estimates for a small area from a 
clustered design, since the number of PSUs sampled in 
each small area is likely to be small. A variance estimate 
based on the PSUs within the small area will then be 
imprecise, with few degrees of freedom, and a generalized 
variance function approach may be preferred (e.g., 
assuming that the national design effect applies for each 
small area). In other words, although the estimate itself 
may be a design-based estimate, the estimate of its variance 
may be an indirect one, borrowing strength from other 
areas. This consideration favors as unclustered a design 
as possible even for planned small area estimates. The need 
to model variances is, however, of lesser concern than the 
need to model the estimates themselves. 

An integral part of the design-based framework is a 
recognition that auxiliary information available for the 
population may be used at the design stage, at the analysis 
stage, or at both stages. When information on auxiliary 
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variables that are closely related to the survey variable is 
available, substantial gains in precision can accrue. The 
use of auxiliary information at the analysis stage, through 
such techniques as post-stratification and ratio, regression 
and difference estimation, has a special appeal for small 
area estimation. It should be emphasized that ratio and 
regression estimators may be motivated by assumptions 
about the model relating the survey variable ( Y) and the 
auxiliary variables (X), but that the resultant estimators 
are design-consistent irrespective of the appropriateness 
of the model. The use of an appropriate model produces 
the greatest gains in precision, but the estimates are approx- 
imately unbiased whatever model is chosen. This may be 
seen in a simple case where variables X), X>, ..., Xp are 
known for every element in the population, and the linear 
combination Y; = By + B,X\; + ... + B,X,;is used to 
estimate Y;, the value of the Y-variable for population 
element 7. Assume, for simplicity that the B’s are deter- 
mined from external data, not dependent on the sample. 
With Y; = Y; + e;, the domain total is Y, = YVieq¥; + 
V ica @; = Yq + Eq. Since Y, is known, the estimation prob- 
lem is one of estimating E,. From a sample of elements 
in domain a, E, may be estimated by E, = Y jes, & / 7%; 
where 7; is the selection probability for element / in the 
sample. The estimator £, is unbiased, independent of the 
validity of the model employed. The estimation procedure 
in fact translates the estimation problem from one of esti- 
mating Y, directly to one of estimating FE, and adding on 
a known constant Y,. To be effective, the procedure 
requires the domain variance of the e; to be smaller than 
that of the Y;. There is no requirement that E, = 0. The 
general logic remains the same in the more usual situation 
where the B’s are estimated from the sample. In this case, 
the estimate of Y, is design-consistent, irrespective of the 
model adopted (Sarndal 1984). Moreover, the B’s may be 
estimated from the sample data only for the domain of 
interest, producing what SGM term a direct estimator, or 
from the total sample, producing a modified direct esti- 
mator. A key consideration in the choice between the 
direct and modified direct estimators in this case is whether 
the overall B’s also apply for the domain. If not, inter- 
action terms between the X’s and the domain indicators 
are called for in the total sample model. With a full set of 
these interaction terms, the modified direct estimator in 


effect then reduces to the direct estimator. 
The need for a model-dependent approach occurs when 


the design-based estimate lacks sufficient precision even 
after the auxiliary data available have been used in as 
effective a manner as possible. Indeed, in some cases the 
computation of a direct estimate may be impossible because 
there are no sample cases in the small area. In such situa- 
tions, it becomes necessary to use a Statistical model to 
borrow strength from other data, often data from other 
areas. Such models are built upon assumptions (e.g., 
E, = 0 in the above example), and the quality of the 


19 


resultant small area estimates depends on the suitability 
of the assumptions made. The assumptions are inevitably 
incorrect to some degree, leading to biases in the small area 
estimates. Since indirect estimates are biased, the design- 
based mean square error (MSE) is widely used as the 
measure of their quality, where MSE = V’ + B?* and 
V’ is the variance and B is the bias of the estimate. 

The common way to compare the quality of a direct and 
an indirect estimate is to compare the variance, V, of the 
former with the MSE of the latter. However, reading the 
paper caused me to question whether the MSE is the 
appropriate measure of quality of an indirect estimator. 
In a practical setting the variance V of the direct estimate 
can be estimated whereas the design-based MSE of the 
indirect estimate cannot. In view of this situation, if 
V = MSE, then the direct estimator would be clearly 
preferred. In fact, the direct estimator may tend to be 
preferred if the direct estimator has adequate precision, 
irrespective of the likely relative magnitudes of V and 
MSE. In other cases, if B is the expected bias, then the 
direct estimator may be preferred to the indirect estimator 
unless V > V’ + kB’, where k is a multiplier greater 
than 1 that allows for the fact that the unknown bias may 
be larger than expected. 

The same argument can be applied to combined (or 
composite) estimators that employ a weighted average of 
a direct and an indirect estimator. Often the principle for 
choosing the weights is taken to be to minimize the mean 
square error of the combined estimator, leading to weights 
for the direct and indirect estimators that are inversely 
proportional to V and MSE, respectively. However, 
following the above argument, an alternative procedure 
would be to minimize the weight of the indirect estimator, 
subject to the condition that the combined estimator is 
sufficiently accurate. Alternatively, the weights could be 
determined on some maximum likely value of the MSE, 
rather than the expected MSE, to reduce the risk of serious 
bias in the combined estimator. 

I do not follow the rationale for the sample size depen- 
dent estimators described by SGM in equation (6.11) and 
(6.12) in general, but under certain assumptions they may 
be seen to fit in to the logic given above. With an equal 
probability sample design and 6 = 1, these estimators 
reduce to the direct estimator when the achieved sample 
size is greater than, or equal to, the expected sample size. 
If one assumes that the expected sample size gives adequate 
precision for the small area, this outcome accords with the 
above reasoning. If the achieved sample size is smaller than 
expected, the sample size dependent estimator takes a 
weighted average of a direct and an indirect estimator. If 
one assumes that the expected sample size is the minimum 
sample size to give the required precision, this outcome 
also accords with the above reasoning. If this indeed is the 
basis of the sample size dependent estimators, then it 
would seem useful to generalize them to situations where 
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the expected sample size is not the sample size that just 
gives the level of precision required. 

As has been noted, auxiliary information plays an 
important role in the production of accurate small area 
estimates. Such information may be used for improving 
the precision of design-based estimates or it may be used 
in the models employed with the model-dependent approach. 
Ideally auxiliary information that is highly related to the 
survey variables involved in the estimates is required. The 
regular compilation of up-to-date auxiliary data for small 
areas from administrative and other sources can provide 
a valuable resource for a small area statistics program. 

Although the paper mentions the more general problem 
of small domains, it focuses predominantly on small areas. 
This is in line with the general literature and the application 
of indirect estimation procedures. In part, this may be 
because the number of socio-economic and other small 
domains of interest (e.g., age/sex domains) is usually 
relatively small, compared with the numbers of small 
areas, so that socio-economic domains can be handled by 
designing the sample to provide direct estimates of adequate 
precision for each of them. In part, it may be because the 
definitions of socio-economic and demographic domains 
are often chosen in the light of the feasibility of producing 
design-based estimates of adequate precision for them 
(eé.g., using wider age groupings for some domains); in the 
case of areal domains, however, the areas are predefined, 
and no collapsing of areas is acceptable. In part, it may 
be because there is a lack of auxiliary data to use in the 
statistical models for such domains. In part, it may also 
be because the analysis of socio-economic domains is often 
conducted to make comparisons between the domains. 
Such comparisons are distorted when the estimate for one 


domain borrows strength from other domains (see, for 
example, Schaible 1992). This issue brings out the general 
point that indirect estimates should not be uncritically used 
for all purposes. 

In conclusion, I should like to express my support for 
the general approach of this paper. Where possible, samples 
should be designed to produce direct small area estimates 
of adequate precision, and sample designs should be 
fashioned with this in mind. Auxiliary data should be used, 
where possible, to improve the precision of direct small 
area estimates. When indirect estimates are called for, a 
cautious approach should be used. Models should be 
developed carefully, estimators that are robust to failures 
in the model assumptions should be sought, and evaluation 
studies should be conducted to assess the adequacy of the 
indirect estimates. Lacking good measures of quality for 
individual indirect estimates, such estimates need to be 
clearly distinguished from design-based estimators. Since 
indirect estimates are not universally valid for all purposes, 
users need to carefully assess whether the given form of 
indirect estimate will satisfy their particular needs. 
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RESPONSE FROM THE AUTHORS 


We would like to thank Wayne Fuller and Graham 
Kalton for their stimulating comments, which we find to 
be quite complementary to the position developed in our 
paper. In many cases their comments make certain points 
clearer and strengthen the arguments presented. Encouraged 
with this kind of endorsement we would like to carry some 
of the points about survey design further, while responding 
to the main points made by the discussants. 

There is no doubt that survey designers try to optimize 
the design under operational constraints to meet the stated 
objectives of a survey. There are usually several objectives 
to be met by major surveys and it is quite likely that 
designers have limited influence in the setting of priorities 
among the various competing objectives. Nevertheless, it 
is at this stage of priority setting that the case for small area 
needs should be made strongly, particularly for major 
continuing surveys. 

During the sixties and seventies emphasis in most countries 
was placed on sub-national (state/provincial) estimates and 
certain compromises were made to the earlier designs that 
optimized national estimates. For example, different 
sampling fractions were used to ensure a minimum sample 
size for smaller states/provinces. With the demands for 
data at the sub-state/province level, such as, county, district 
and municipality, more compromises to the national 
optimum allocation become necessary, requiring differing 
sampling fractions among the administrative areas within 
states/provinces. For example, if the aim is to produce sub- 
provincial estimates of comparable quality, then provinces 
will likely receive sample roughly proportional to the 
number of subprovincial regions they contain. Such an allo- 
cation may not be the same as one using the relative popula- 
tion sizes of the provinces. As we discussed in section 5.4, 
the allocation approach should put more emphasis on a 
bottom-up strategy. Losses at higher levels and gains at 
lower levels would differ from survey to survey but it is 
likely that in many cases a minor loss in CV at the national 
level will lead to appreciable gains at small area levels. 

Kalton stresses the importance of reduced clustering 
for variance estimation; it is advantageous to increase the 
degrees of freedom by having a large number of smaller 
clusters rather than a small number of larger clusters. We 
would like to emphasize that clustering has another draw- 
back for estimation, and especially small area estimation, 
namely, a highly clustered design will lead to high design 
effects, even for planned small domains. The usual reason 
for resorting to clustered designs is to reduce survey costs. 
In light of the changes that continue to occur in the data 
collection process, such as decreased reliance on at-home 
interviews and increased use of computer assisted inter- 
viewing, a periodic review of the cost-variance models that 
underlie clustering decisions is necessary. 


One other issue not addressed in our paper is the impact 
of sample rotation in continuous surveys. For a given time 
point, there may be insufficient sample in some small 
domains to produce reliable estimates. But, as units rotate 
out of the sample and are replaced, the accumulated or 
effective sample in the domains increases and may allow 
the computation of reliable, albeit time-biased, domain 
estimates. By judicious choice of rotation schemes, survey 
designers can maximize the cumulative sample size over 
some time period. For example, for quarterly estimates in 
a monthly survey, the optimal rotation pattern is [1(2)] *, 
i.e., repeat the sequence ‘‘one month in sample, two months 
out’’ k times. This thinking is in the same spirit as Leslie 
Kish’s ideas on cumulation of samples over time. 

Kalton clarifies and elaborates the cautious approach to 
the use of indirect estimators by suggesting a weighted mean 
squared error, which attaches a weight greater than | to the 
bias term, to allow for the fact that the bias of the indirect 
estimator may be larger than expected. There are two 
distinct reasons why the bias may be larger than what is 
expected from the model for small area effects: random 
variation within the model, and model breakdown. It is 
worth recalling here the suggestion of Fay and Herriot 
(1979) to constrain a combined estimate to be within one 
standard error of a design estimate; this approach makes 
allowance for the possibility of large bias in the model 
estimator for whatever reason. Kalton also reiterates our 
position that if a direct estimator is of acceptable quality, 
then in practice, one may decide to use this direct estimator 
even though its estimated mean squared error exceeds that 
of model-based competitors. Because there is always the 
possibility of model failure lurking in the background, this 
“‘better safe than sorry’’ approach is desirable, at least until 
some experience with particular indirect estimators in 
specific situations has been gained. This does not contradict 
the view that there arise situations in which it is necessary 
to throw caution to the wind. 

In his remarks on the sample size dependent estimator, 
Kalton’s comments imply that there is a risk in the strategy 
which gives the synthetic component zero weight if the 
observed sample size in the small domain exceeds the 
expected sample size there since the latter may be too small 
to yield adequate direct estimates. One option is to use a 
value Nj, Which is the size that produces direct estimates 
that are just barely acceptable. Note, however, that yin as 
defined here is characteristic-dependent. 

In his comments, Fuller briefly describes an approach 
to small area estimation that takes advantage of a variance 
components model and yet has fixed weights for internal 
consistency among estimators for different characteristics. 
Besides internal consistency of small area estimates for 
different characteristics, a second type of consistency that 


22 Singh, Gambino and Mantel: Issues and Strategies for Small Area Data 


is sometimes required is that estimates of totals for the set 
of small areas within a larger area should add up to the 
published direct estimate for the larger area. One way to 
achieve this is to benchmark the small area estimates to the 
direct estimate for the larger area using, for example, a 
simple ratio adjustment; however, if the ratio adjustment 
factors depend on the characteristic then this would destroy 
the first type of consistency. Both types of consistency 
could be achieved simultaneously if the direct estimators 
for the larger area are generalized regression estimators, 
Y, + (X — X,)8, and the modified direct (Section 6.1 in 
the paper) estimators Yorop.q = Yo + (Xa — Xv,q)B are 
used for small areas. 

As Fuller notes, the average squared bias of an estimator 
for any subset of small areas can be estimated. Here we 
would like to stress again that the average bias over a set 
of small areas is not directly relevant for any particular 
small area. It is for this reason that we prefer to use, 
whenever possible, estimators that are approximately design 
unbiased. When use of a model estimator is unavoidable, 
serious attempts should be made to find appropriate 
covariates for which reliable auxiliary information is avail- 
able in order to minimize the residual bias of the model 
estimator. 

Perhaps due to the obvious timeliness problems associated 
with census data, neither of the discussants commented on 
censuses as a source of data for smaller domains. In this 
context it is worth mentioning that some form of ongoing 
major post-censal survey replacing or supplementing the 


decennial census long-form may be considered. Such a 
strategy, called rolling samples, is described by Kish (1990); 
a similar approach, called continuous measurement, is 
described by Alexander (1994). This approach provides a 
number of options which are worth investigating as poten- 
tially cost effective means of producing timely statistics for 
smaller domains. 

Lastly, we would like to stress that the emphasis we put 
on keeping domain estimation in mind at the design stage, 
particularly for medium size domains, in no way under- 
mines the important role of models in estimating for very 
small domains. 

We hope that the general direction of the strategy pro- 
posed in the paper, supplemented by the fine points brought 
out by the discussants, particularly the support and cautions 
summarized by Kalton in his concluding paragraph, will be 
helpful to survey designers and researchers in finding 
solutions appropriate to the particular problems they are 
dealing with. 
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Small Domain Estimation for Unequal Probability 
Survey Designs 


D. HOLT and D.J. HOLMES! 


ABSTRACT 


The problem of estimating domain totals and means from sample survey data is common. When the domain is large, 
the observed sample is generally large enough that direct, design-based estimators are sufficiently accurate. But 
when the domain is small, the observed sample size is small and direct estimators are inadequate. Small area estimation 
is a particular case in point and alternative methods such as synthetic estimation or model-based estimators have 
been developed. The two usual facets of such methods are that information is ‘borrowed’ from other small domains 
(or areas) so as to obtain more precise estimators of certain parameters and these are then combined with auxiliary 
information, such as population means or totals, from each small area in turn to obtain a more precise estimate 
of the domain (or area) mean or total. This paper describes a case involving unequal probability sampling in which 
no auxiliary population means or totals are available and borrowing strength from other domains is not allowed 
and yet simple model-based estimators are developed which appear to offer substantial efficiency gains. The approach 
is motivated by an application to market research but the methods are more widely applicable. 


KEY WORDS: Synthetic estimation; Design-based estimation; Small area estimation; Model-based estimation; 


Market shares. 


1. INTRODUCTION 


This paper is concerned with the common problem of 
estimating domain totals and means from a disproportion- 
ately allocated sample survey. Some domains may be large, 
in which case the achieved sample size may be large too 
and design-based (or direct) estimators will be satisfactory. 
Some domains may be small, in which case the achieved 
sample size may be small too and design-based (or direct) 
estimators will be too imprecise for practical use. The 
methods proposed will be motivated through the example 
of estimating sales, market shares and market penetrations 
for products in a market research survey. The domains are 
particular auto manufacturers or models. However, the 
general approach is applicable to other disproportionately 
allocated surveys of businesses or institutions. 

The problem is analogous to that of using synthetic 
estimation for small area estimation (Gonzales 1973; 
Gonzales and Hoza 1978; Platek et al. 1987). Synthetic 
estimation usually depends on two factors: (i) the use of 
auxiliary variables in conjunction with population means 
or totals for each small area (or domain) to improve 
estimates through poststratification or regression estima- 
tion, and (ii) the improvement of estimates by pooling 
data across the small areas (or domains). In our situation 
no auxiliary population means or totals are available 
and, since the essential objective is to compare domains 
(i.e., manufacturers and particular products), the idea of 
borrowing strength between these is inadmissible. A class 


of synthetic estimators is proposed which uses neither of 
these two approaches and yet is preferred to the direct 
survey estimators. The proposed estimators have a simple 
structure, an interesting interpretation and can be justified 
under a set of model assumptions which are testable under 
the general assumption of non-informative survey design. 


2. THE MARKET RESEARCH EXAMPLE 


Market researchers often estimate the total volume of 
sales and market shares for each manufacturer of a partic- 
ular product. We consider the case of autos purchased for 
company fleet use in a single year. Estimates of totals and 
market shares are required for each auto manufacturer and 
for specific models which are widely purchased for fleet use. 

The terms ‘fleet’ and ‘company’ are each interpreted 
widely. A fleet car is taken to mean any auto purchased 
on a commercial as opposed to a private basis, and used 
in conjunction with a business in the broadest sense. This 
includes autos purchased for sales representatives which 
may be purchased in large numbers. It also includes single 
purchases of luxury cars for company directors and other 
senior staff of large companies, as well as purchases by 
small ‘companies’ such as groups of doctors, or self- 
employed people such as shop owners. Thus the population 
of purchasing companies - termed consumers - includes 
a large number of small companies that purchase only one 
or two autos every few years. 
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In the reference period of one year we define Y;; to be 
the number of autos of product type k purchased by con- 
sumer i. The product type k (the domain) may refer toa 
specific model of a particular manufacturer, or to all 
models produced by a manufacturer. Thus, Y, = Yj V4 
is the total number of autos of type k purchased by all con- 
sumers. Let Z; be the total number of autos of any kind 
purchased by consumer /, and Z = YZ; be the total 
number of auto sales. The market share for product type 
k is defined as Ry = Y;/Z. 


We further define 


Vii Je if Wes. 0 


0 if Y;~=0 


and 
Zi=1 if Li 0 
=0 if Z,=0. 


Thus, Y;; and Z/ are indicator variables for consumers 
who purchase product type k and at least one auto of any 
kind, respectively, in the reference period. The number 
of consumers that purchase product k is thus given by 
Yj, = ¥; Y;,and the total number of consumers purchasing 
at least one auto of any kind is given by Z’ = Y;Z;.The 
market penetration for product k, in terms of the propor- 
tion of consumers buying a car of any type in the reference 
period who buy type k, is given by Ri = Y{/Z’. 

The four parameters Y,, R,, Yi and Rj are all legiti- 
mate targets of inference in market research and are 
defined as finite population parameters; namely, domain 
totals or ratios of domain totals. 


3. THE SURVEY DESIGN AND DIRECT 
ESTIMATORS 


The survey design was based upon two mutually exclu- 
sive frames and may be regarded as a simple stratified 
design with ten strata. The first frame was a register 
(Dun and Bradstreet) of 35,000 companies, stratified into 
eight strata on the basis of the number of employees and 
whether the company was classified as ‘manufacturing’ or 
‘distributing’. The second frame was a large register of 
1.4 million British Telecom business subscribers, stratified 
into ‘private’ and ‘commercial’ numbers. Note that both 
private and commercial numbers were business subscribers 
but commercial numbers were allocated if separate com- 
mercial premises were occupied. 

Using previous survey data the sample was optimally 
allocated using Neyman allocation to minimize the 
variance of the estimator of the total number of autos pur- 
chased (Z). Data on auto purchases were collected 
immediately after the end of the reference year. The strata 


sizes {N,,} and sample allocations {n,,} for stratah = 1, 
..., 10 are given in Table 1. 


Table 1 
Sampling Frame: Sample Size and Weight by Stratum 


Stratum Sample Weight 
Stratum (/) Size Size 
Np Np Th = Np/Np 


British Telecom: 


Private 389,445 1,150 338.65 
Commercial 1,007,399 7,406 136.02 
Dun and Bradstreet: 
Manufacturing 
50-99 employees 6,646 235 28.28 
100-499 6,826 il, iis 6.13 
500-999 992 520 1.91 
1,000 + 1,110 849 esi 
Distributing 
50-99 employees 8,703 472 18.44 
100-499 7,625 1,437 Dae) 
500-999 1,133 484 2.34 
1,000 + {15728} eel? 1.36 
Overall 1,431,402 14,783 96.83 


The sample is a simple, disproportionately allocated 
stratified design and the direct estimators and their vari- 
ances are well known. The stratification results in large 
differences in sampling weights (1.31 to 338.65) and is 
useful but far from ideal. Many consumers do not pur- 
chase any autos at all in the reference year so that each 
stratum contains a mixture of zero and non-zero responses. 
For any particular product k the proportion of zero 
responses in each stratum is obviously larger. 

Table 2 contains the direct survey estimates, estimated 
standard errors (see Holt and Holmes (1993) for derivation), 
and coefficients of variation for a selection of products 
from different auto manufacturers. Products A and B 
represent all models for two major auto manufacturers. 
Product C is a single model with a substantial share of the 
fleet market from manufacturer A. The remaining products 
have small market shares. Products F and G cater for the 
executive part of the fleet market. The list is incomplete 
so that the market shares do not sum to one. Also note that 
the product categories are not mutually exclusive. In 
general the survey was judged to perform satisfactorily but 
it was observed over a period of years that estimates for 
manufacturers or models with small market shares were 
unstable. This is best seen in terms of the coefficient of 
variation which is greater than 0.1 for products with small 
market shares and can be greater than 0.15 or 0.2 in some 
cases. This instability also affects the estimates of variance 
as well as the estimates of total sales or market shares of 
the products. 
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Table 2 


Direct Survey Estimates, Standard Errors and Coefficients 
of Variation for Selected Products 


Estimating Consumers Estimating Autos 


Product —__—___———_—_—- 
(k) Total Penetration Total Share 
Yk Rx Yx Rx 

A 59,890 3843 270,051 3781 
(2,651) (.0144) (35,704) (.0315) 

(.044) (.037) (.132) (.083) 

B 34,282 .2200 153,518 2149 
(1,960) (.0117) (8,653) (.0131) 

(.057) (.053) (.056) (.061) 

Cc 23,363 .1499 81,381 sl) 
(1,602) (.0098) (17,559) (.0194) 

(.069) (.065) (.216) (.170) 

D 13,857 .0889 255312 0354 
(1,311) (.0081) (2,906) (.0039) 

(.095) (.091) (.115) (.110) 

E 9,025 0579 24,370 .0341 
(1,146) (.0072) (7,336) (.0101) 

(.127) (.124) (.301) (.296) 

13 5125 0329 13,724 .0192 
(676) (.0043) (2,369) (.0030) 

(.132) (1331) (.173) (.156) 

G 7,518 .0482 11,031 .0154 
(1,015) (.0064) (1,456) (.0022) 

(.135) (3133) (.132) (.143) 


Row |: estimate Row 2: s.e. Row 3: c.v. 


4. A MODEL-BASED APPROACH 


Given the sample design there is no prospect of im- 
proving the efficiency of the direct survey estimators 
within the conventional sample survey framework. The 
usual approaches are through the use of auxiliary infor- 
mation for poststratification, ratio or regression estimation 
but all of these require knowledge of population means 
or totals. No such information is available. We turn instead 
to a model-based approach to provide alternative esti- 
mators for the whole range of products. 


4.1 Estimating Y,;: the Number of Consumers 
Purchasing Product Type k 


We consider, initially, the number of consumers who 
buy product type k. We extend the notation from Y;{; to 
Y;,; 1n the obvious way to define the indicator random 
variable of purchase for product k for consumer / in 
stratum h. We treat each consumer’s decision as the out- 
come of a Bernoulli trial. Let P,), be the probability that 
a consumer in stratum / buys an auto of type k [Pyj, = 
Prob( Y{;; = 1)]. We define the model-based equivalent 
of Y;, the total number of consumers of product k, as 


25 


Oy = ye Nn Prin: (1) 
h 


Assuming that each consumer’s decision is independent 
the likelihood may be written as the usual product of bino- 
mial terms. The maximum likelihood estimators are given 
by Py h = Nxgp/Np, and the maximum likelihood estimator 
of 6; is the familiar stratified sampling estimator 


Z N 
Oi) = Yh ten = YY Nn in» (2) 


Pumyte h 


where-n,; is the sample count of consumers in stratum h 
that buy product k, n, is the stratum sample size and 
Vinh = Ngn/Ny is the sample mean for consumers in stratum 
h (i.e., the sample proportion of consumers in stratum h 
who buy product k). This estimator is generally unsatis- 
factory when the sample size for product k is too small. 

Suppose we introduce an additional conditioning factor 
such that every consumer may be categorized into one of 
its categories f, f = 1, ..., F, and further extend the 
definition of the indicator random variable to Y{q;;. 
These categories f will cut across the strata / and the idea 
is to define f so that, within any particular category, 
whether a consumer buys product type k or not is indepen- 
dent of the stratum membership h. In the case of fleet 
purchases we define a categorization based on the total 
number of autos owned and operated by each consumer 
(i.e., the fleet size). A more detailed discussion of the 
choice of fis given in Section 5. 

If N,,, the population counts of consumers in stratum 
h and fleet size category f, are known then (1) may be 
extended in the obvious way and the target parameter can 
now be expressed as 


Ox = om De Nag Pang - (3) 
eT a 


Equation (3) is the case of poststratification if {Nj,;} 
are known, and in this case the additional information will 
lead to a gain in efficiency (Holt and Smith 1979). When 
{N,y} are unknown we may rewrite the model in terms of 
two sets of probabilities: 


QOy\n = Prob {consumer has fleet size f | stratum 1}, 


Ping = Prob {consumer buys product type k | stratum 
h and fleet size f}. 


The target parameter may now be expressed as 


Oj = y »s Nn Qn Pang - (4) 
Pai 
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To obtain an alternative model-based estimator we 
make further assumptions about the model parameters. 
Suppose now that 


Pring = Py p OTA eR He (5) 


This implies that conditional on the categorization f (the 
size of the fleet operated by a consumer), the probability 
of buying product type k is independent of the original 
stratum membership h. Algebraically, the assumption is 
analogous to that used in synthetic estimation for small 
area estimation but in that case information is pooled 
across areas. That form of the assumption is inadmissible 
in our case. We choose instead pooling across strata within 
the domain of study. The idea is to choose a conditioning 
variable which accounts for the marginal association 
between choice of product and stratum membership. 
Using assumption (5) and with the obvious extension 
of the notation (my = Yp_ Nn, etc.) it may be shown that 


A _ “ng 5 _ Ug 
or Re els Se 
Ap Nf 


and the maximum likelihood estimator of 0; becomes 


519) = a sie os toe 
S£l2uso dy Nar omagaers 
h if h ap fi af 
= os Vi; Dig» (6) 
h 


where Ny = Yn Ny Ma /Ny, and Yip = ny /ny is the 
unweighted sample mean for consumers in category f/f 
(i.e. the sample proportion of consumers in category f 
who buy product k). 

Thus (6) has the form of a stratified estimator based on 
the categorization f but with the population sizes in each 
stratum {N;} unknown. Note that an estimator of this 
form, but with known {N;}, would arise naturally if a 
stratified sample based on fhad been selected. In fact this 
is not so: the sample members of category f are not 
selected with equal probability. However, the parameter 
assumptions lead to treating the sample in each category f 
as if it was an equal probability sample since under assump- 
tion (5) the sample weights are uninformative and simply 
lead to efficiency loss when estimating Py ¢- Hence, 
although the sampling fractions n;/N,, are used to estimate 
{N;,} they are not used explicitly in Py p = nyr/np = Viy- 
Note that the estimator pools information across strata h, 
within domain k but not between domains (i.e. products). 

Note that if n,/N, is constant, equation (6) reduces to 
the usual expansion estimator given by (2), and assump- 
tion (5) has not yielded a new estimator. If the sample is 
disproportionately allocated the assumption leads to the 


use of the sampling weights for Ny (where they are needed) 
but not for estimating P,,; (where they are uninformative 
given f and assumption (5)). 

Equation (5) is a strong set of assumptions, requiring 
Pyng to be exactly equal to a common value P, ¢ for all 
h. In practice, random assumptions such as Pyyjr = 
Pug + €xjag May be introduced, where E[ €xj,¢] = 0 and 
Vienne] = oz. These assumptions will lead to hierarchical 
Bayes or empirical Bayes analysis as described in Ghosh 
and Rao (1994) or Fay and Herriot (1979). These methods 
are not developed here since the simple form of the model- 
based estimator would be lost, together with the insight 
that this provides. In a similar vein the approach of Sarndal 
and Hidiriglou (1989) or Drew, Singh and Choudhry 
(1982) may be applied to yield sample size dependent 
estimators without violating the requirement that no infor- 
mation is pooled across domains (products). 

We can compare the estimators in (2) and (6) when 
assumption (5) holds since it may be shown that 


V. (Of = Ni 
e(O¢(1)) = D PA — Pxn) 
7 Mh 


Ni 
mid Dh; As Opin Pes 
ae, 
cei 
pete - Opn Qpn Pei Pps TM 
eh paleys 


where the notation V; (-) is used to emphasize that the 
variance is evaluated with respect to the model-based 
distribution. 


It may also be shown that under assumption (5) 


x Ni 
¥(84(2)) = J) Y) Phir Om (1 — Onin) 
bal oh h 


Nn 
ni PD, y y teh Pai Orin Of tn 
alttp types sh 
Sef’ 
we Ni. Peg C= Pai Qn 
oe 

h a h ba) Np OF |n 


h 


Cc = "Orin) + Mm, OF in 


eel (2ny — 3) Or, — 2(m%), - 2eiel) 
yh Onin (8) 
h 


and that V(O{(1)) — Ve(Ox(2)) = 0. 
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Thus under the additional model assumptions 6 (2) 
has smaller variance as would be expected. These expres- 
sions are model-based variances and no finite population 
corrections arise. A predictive approach to the unobserved 
elements in each poststratum would give rise to finite 
population correction factors. 

The maximum likelihood estimator of the market 
penetration for product type k, Rx, under assumption (5) 
is simply given by 


al ht PA 
LN, Li Np Fig 
ey f ny if 

0,(2) = ——————— =———_, (9) 


Nt Li Ne 
i Mai f 

where “gy is the sample count of consumers in fleet cate- 

gory f that buy an auto of any kind, and Z = ngy/n; 

is the sample proportion of consumers in category f who 

buy an auto of any kind. 


4.2 Efficiency of the Model-Based Estimator of Y; 


To investigate the gain in efficiency of 0{(2) over 
6;(1) we consider the efficiency of the model-based 
estimator, defined by 


V.(Oi(1)) — We (O4(2)) 


6/(2)] = - 
hit) = V,(O;(1)) 


, (10) 


for various population structures in which assumption (5) 
holds. 


We consider a population with strata {4}, stratum sizes 
{N,} and sample allocations {n,} as given in Table 1, 
and a conditioning factor with ten categories f (f = 1, 
..., 10) of increasing fleet size. We compute the efficiency 
factor e[O{(2)] for various combinations of parameter 
values of {Qy),} and {Py}. 


We consider five different structures for {Qy\,}: 


ia 
(a) Orn = foPrht- 1 Pec 10: 
Oy Eaate 
O29 ae—i Ot Heal eee eel 0 
OU eae atone) 10 
(6) Orin = 02025 57 = a fOGe ie le eo 


OS: Sly jf Se) Bite) VS Os fp Se 
0 otherwise 


Band Matrix (0.025, 0.95, 0.025). 


PTY, 
(c) Of, = Band Matrix (0.05, 0.90, 0.05). 
(d) Qj, = Band Matrix (0.05, 0.10, 0.70, 0.10, 0.05). 
(e) Orin = 9.1 Oye If) bee ee) 


and f = 1, ..., 10. 


We consider four different structures for {Py /}: 


¢ Onl ofteods 2 

1) ew Sipe 

Eis . otherwise. 
Pas = OL Oi) ore yr 1, 22%, 10. 
(iii) Pyyp = O.1F Coe way = Neh. 5 10, 
(iv) Py = 0.5 LOT La Siplssscian lO: 


Structure (a) is one where the categorization f coincides 
with the stratification. In structures (b), (c) and (d), in any 
particular stratum / the majority of consumers fall into 
one fleet category (f = h) with a few consumers in 
neighbouring categories (e.g., for (b) and(c) f = A — 1, 
h + 1). Finally, structure (e) implies that, in any stratum 
h, consumers will be equally likely to fall into any one of 
phesilcetecatesOniesajmi—— lnmaee al Oe 

Structure (i) for Py), implies a type of auto that is 
purchased with a small probability by consumers with 
small fleet sizes (i.e. that fall in categories f = 1 or 2), 
but not purchased by consumers with large (7) fleet sizes. 
Structure (ii) suggests a type of auto purchased with small 
probability which decreases as fleet size increases, whilst 
structure (ili) implies the reverse. In structure (iv) a popular 
model is bought with probability 0.5 regardless of the 
consumer’s fleet size. 

Table 3 gives the efficiency factor defined in (10) for 
each combination of structures for Oy, and P, under the 
disproportionate allocation given in Table 1. Column (a) 
of the table is the special case where the stratification and 
the categorization f coincide, and the two estimators 
reyAl 1) and 6/(2) are the same. The table shows that large 
gains in efficiency (e.g., 70%) can be attained for certain 
parameter combinations: the weaker the association 


Table 3 
Efficiency Factors, e[@{(2)], for Various Combinations 
of Orin and Px f 


Structure for Of|; 


(a) (b) (c) (d) (e) 


O05 = FONI6 = 0.555 0.648 
0.116 0.206 0.391 0.695 
0.103 0.181 OFS Sa 0.625 
0.115 0.203 0.391 0.706 


(i) 

Structure (ii) 
for Px f (iii) 
(iv) 


Sy eS) SKS) 
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between f and / the greater the efficiency gain. Even for 
structures (c) and (d) where the association between f and 
h is strong, substantial efficiency gains can be achieved. 
The structure Qy), is much more important than Py in 
determining efficiency gain. 

In the special case (e) where Qy;,, is a constant for all 
f and h it can be shown that the efficiency factor can be 
expressed as 


52 x TNi/Np 
; h 
ef O{(2)] =( ia ip sted a (1) 
Pap Cl — Pay) uae 
n/ Np 
h 
where 


E pee ae tL nel BY es 
Pup = pleat and 6 tie (Pap — Pay) 
faa if) 


are the mean and variance of {P,,,} over the categories 
f, and 7, = 1 — n,/n + O(n~!). The term in paren- 
theses in (11) lies between 0 and 1 and it’s value depends 
on how the { Py, ;} vary over the categories f. In case (iv) 
Py, 18 constant and so this term is unity. The second term 
of (11) depends solely on the design, and its value for the 
sample allocation specified in Table 1 is 0.706. 


4.3 Estimating Y,: the Number of Autos Purchased 
of Product Type k 


The previous approach in Section 4.1 may be extended 
to the number of purchases. We introduce a further 
conditioning factor which represents the total number of 
autos purchased, m, regardless of product type, and we 
extend the notation in the obvious manner to Yxpfm;, the 
random variable representing the number of autos of 
product type k purchased by consumer / in stratum A, fleet 
size f, and buying m autos of any kind. The idea is that 
the number of purchases of product k is likely to vary 
depending on the total number of autos purchased. Let 


Sm\n¢ = Prob{consumer buys m autos of any kind a 
in iO elie sees 


To\nfm = Prob{consumer buys ? autos of typek | h,f,m}, 
(ead ad aoa 778 


The model-based target parameter, equivalent to the 
total purchases of product k, Y;, is extended from (4) and 
may now be expressed as 


0; = yy »y yy s NnQrin Sming Tointm °- (12) 
lil 


m e 


We consider two sets of additional assumptions, the first 
of which is 


Tiel =o be awtonialligdhs (13) 


These assumptions imply that conditional on fleet size 
category, f, and the total number of new autos purchased, 
m, the distribution of the number of autos purchased of 
product type k is independent of stratum h. 


The maximum likelihood estimator of 8, under assump- 
tions (13) is 


OO) Saye ge (14) 
ji te 


where Nem = LaNanntm/Mns and Jegm = Lol Npme/Npm 
is the unweighted sample mean of the number of autos 
of product type k purchased by consumers of fleet size f 
that purchased a total of m autos of any kind. 

The selection probabilities are used here to provide a 
weighted estimator of Ny, the total number of con- 
sumers of fleet size fthat buy m cars of any kind. The form 
of the estimator is analogous to that in equation (6). Under 
the model assumption (13) it may be shown that 


A Ni 
Ve(Ox(2)) = YP YY) jm Qin 1 — Qjmin) 
LAY WP Thee C45 


B) 


N 
> » D DB »; yy = Pfm Pf ’m’ Orin Of mh 
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Np 


Ni OF Ofm|h 
+ 
» Lu » Mh e An Ofmin 
h 


Cc _ Qrmin) + Ny Qrm|n 


[1 + (2m, — 3)Opmn — 2(My — Jena) 


Vy An Qmin 
h (15) 


where Ogmin = QOsin Smings btm = Eel Vengmi}, and 
OF a Vif Ygntmi} : 

In practice, Vx, will be based on very few observations 
if few customers in fleet size category f purchase exactly 
mcars. For more stability m may be defined as an ordinal 
variable by grouping the total number of autos purchased 
into a small number of categories. In this case assumption 
(13) implies that the distribution of purchases for product 
type k is the same within fleet size category f and total 
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purchase category m. Also, ¢ may be treated as a con- 
tinuous random variable and distributional assumptions 
made about ? leading to ratio or regression estimators. 


A second and even stronger set of parameter assump- 
tions is 


To\nfm = T | fm for all h, 
Smiag = Smip forvall’ hi (16) 


These assumptions imply that conditional on fleet size 
category, /, the joint distribution of the number of autos 
purchased of type k and the total number of autos pur- 
chased of any kind, m, is independent of the stratum h. 
In this case the maximum likelihood estimator of 0, is 
given by 


6,.(3) = Y) Ny Fey, (17) 
f 


where Yxr = Lol Npe/npis the unweighted sample mean 
of the number of autos of product type k purchased by 
consumers in fleet size fregardless of how many autos the 
consumer bought in total, and Nr = Yp_ Np Mar/Np is a 
weighted estimator of the number of consumers of fleet 
size f overall. It may be shown that under assumptions (16) 


z Ni 
Ve(Ox(3)) = YY) a7 Opn L - Opn) 
Bigs Joust 


2 


N 
TAP EPAP — He Hee Orn Orth 
benifit Ui h 


Cc — On) + Mn Otn 


r i + (2m, — 3)Opn — 2m, — 22h) 


by An Of|n 
h (18) 


If assumptions (16) were plausible then y,, would be 
based on larger sample sizes than Y,s,, in (14) and hence 
6,(3) would be more stable. 


The maximum likelihood estimator of the market share 
for product type k, R,, under assumption (16), is given by 


29 


A 


0,(3) = 
iT N; 2 
f 


by Ny Dig 
i (19) 


where Z,, defined analogously to Yx,;, is the unweighted 
sample mean number of autos of any kind purchased by 
consumers in fleet category /. 


5. EMPIRICAL RESULTS 


5.1 Estimating Consumers 


In Section 4.2 the efficiency of 6/(2) was investigated 
for various population structures when assumption (5) 
held. Readers may find this measure unconvincing since 
(5) will not hold in practice. We now use the actual survey 
data to compute 6/(2) for a particular categorization of 
the conditioning factor that is defined by a combination 
of the fleet size and whether or not the consumer pur- 
chased any autos of any kind for fleet use (see Table 4). 
Empirical evaluations of synthetic estimators have been 
carried out by Schaible, Brock and Schnack (1977) and 
Drew, Singh and Choudhry (1982) in different contexts. 

For each of the products A-G listed in Table 2 a x” test 
was used to test the hypothesis that, conditional on the 
category of the conditioning factor (f), whether or not 
a consumer purchases that product is independent of 
stratum (h). Note that for our example the design is 
stratified random sampling and standard multinomial 
assumptions apply. For multistage designs, the standard 
x’ analysis would have to be adjusted by using Rao-Scott 
adjustments for example. In practice it is difficult to find 
a categorization f such that conditional independence 
assumptions (5) hold for every product type. However, for 
the categorization defined in Table 4 it was found that 


Table 4 


Definition of the Categories, f, of the 
Conditioning Factor 


Definition of f 


Categories 
J. Fleet Size Fleet Purchases 
1 Any 0 
2 1-4 =) 
3 5-8 0 
4 9-15 >0 
5 16-25 S10) 
6 26-50 >0 
7 51-100 0) 
8 101-200 >0 
9 201-550 So 
10 550) >0 
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most of the variability in the probability of purchasing a 
particular product type was explained by the category f of 
the conditioning factor and very little of the residual varia- 
tion was due to differences in strata. 

The model-based estimates for consumers, 6/(2) and 
0;(2), obtained from (6) and (9) respectively, are given 
in Table 5. The model-based variances may give an opti- 
mistic view of the precision of the estimators since they 
depend on the conditional independence assumptions in 
the model which may be untrue in practice. Alternatively 
the usual survey estimate of the p-based variance of the 
model-based estimator may be derived (see Holt and 
Holmes 1993). This requires no distributional or condi- 
tional independence assumptions of any kind and might 
be considered a more objective measure. These estimates 
of standard errors are given in Table 5S. Since the estimated 
standard errors are design-based, they include finite popu- 
lation corrections. [We note here that the model-based 
standard errors for 6/(2) (not shown in Table 5) were 
consistently around 10% smaller than the p-based standard 
errors]. 


Table 5 
Model-Based Estimates with p-Based Standard Errors 
for Selected Products 


Estimating Consumers Estimating Autos 


Product SSS 
(k) Total Penetration Total Share 
0; (2) 2% (2) 9; (3) 0;(3) 

A 63,433 .4070 263.511 GPP? 
(2,230) (.0105) (13,007) (.0048) 

B 39,673 2546 177,067 .2501 
(1,587) (.0086) (9,530) (.0046) 

(G 21,930 1407 65,357 .0923 
(1,142) (.0066) (3,836) (.0027) 

D 13,422 .0861 22,146 .0313 
(868) (.0052) (1535)) (.0016) 

E 7,366 .0473 15,798 0223 
(675) (.0041) (1,223) (.0014) 

1g 5,826 .0374 14,398 .0203 
(492) (.0031) (1,113) (.0012) 

G 7,686 0493 11,207 .0158 
(633) (.0039) (813) (.0011) 


Row 1: estimate Row 2: p-based s.e. 


Comparing these results with the usual survey results 
given in Table 2 we find that the standard errors for esti- 
mating totals are considerably smaller - around 30-40% 
smaller for all products except A and B (the major 
manufacturers) where the reduction is about 15-20%. This 
pattern is expected since the original survey design was 
optimal for the total sales of autos and therefore relatively 


efficient for products with a large market share. We expect 
the products with smaller market shares to benefit most 
from the model-based approach. 

For estimating market penetration the reduction in 
standard error is again about 30-40% with slightly smaller 
reductions for products A and B. 


5.2 Estimating Autos 


Table 5 also contains model-based estimates for the 
total number of autos purchased of type k and the cor- 
responding market share, 6,(3) and 0,.(3) as defined by 
(17) and (19) respectively, for the same categorization f 
of the conditioning factor as given in Table 4. P-based 
standard errors for these estimates are also presented in 
Table 5. 

Comparing with the standard survey estimates given in 
Table 2 large reductions in standard errors for estimating 
totals are obtained (40-80%) apart from product type B. 
Similarly, for estimating the market shares the reduction 
in standard error is again substantial. 


6. DISCUSSION 


The model-based estimators are derived using condi- 
tional independence assumptions to partition the estima- 
tion problem into two components. The first, an estimate 
of N; (the number of consumers of fleet size f), makes use 
of the unequal selection probabilities, whereas the second, 
an estimate of the proportion of consumers of fleet size 
Jf buying product type k (or the average number of autos 
of product type k purchased by consumers of fleet size f) 
does not. This can result in a substantial efficiency gain. 

If the conditional independence assumptions are invalid 
then in ordinary design-based terms the estimators will 
have a residual bias but this may be an acceptable risk to 
achieve stability of the estimators over the whole product 
range. For the numerical results in previous sections, only 
the model-based estimates for product B are outside of 
the 95% confidence interval based on the direct survey 
estimator. The conditional independence assumptions will 
depend on the choice of the categories /, and can be tested 
using chi-square tests for contingency tables. 

Whilst the results in Table 5 show that the design-based 
standard errors for the model-based estimates are gener- 
ally smaller than for the direct estimates shown in Table 2, 
it may be argued that the model-based estimators may be 
biased and hence provide no gain in terms of mean- 
squared error (MSE). The bias will arise from the inappro- 
priateness of the conditional independence assumptions 
(e.g., equation (5)). This is not testable, but a comparison 
of Tables 2 and 5 can give some insight into the size of bias 
that would be required to cause the MSE to be the same 
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for both the direct and the model-based estimators. Con- 
sider the estimate of total consumers for product E which 
is strongly affected by the procedure and hence perhaps 
most susceptible to bias. The variance (and hence MSE) of 
the direct estimator is 1,1467 = 1,313,316 whereas for the 
model-based estimator the variance is 6757 = 455,625. 
Hence, the model-based estimate of 7,366 would need a 
bias of 926 in order for the MSEs to be the same. 
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Time Series EBLUPs for Small Areas Using Survey Data 


A.C. SINGH, H.J. MANTEL and B.W. THOMAS! 


ABSTRACT 


In estimation for small areas it is common to borrow strength from other small areas since the direct survey estimates 
often have large sampling variability. A class of methods called composite estimation addresses the problem by 
using a linear combination of direct and synthetic estimators. The synthetic component is based on a model which 
connects small area means cross-sectionally (over areas) and/or over time. A cross-sectional empirical best linear 
unbiased predictor (EBLUP) is a composite estimator based on a linear regression model with small area effects. 
In this paper we consider three models to generalize the cross-sectional EBLUP to use data from more than one 
time point. In the first model, regression parameters are random and serially dependent but the small area effects 
are assumed to be independent over time. In the second model, regression parameters are nonrandom and may take 
common values over time but the small area effects are serially dependent. The third model is more general in that 
regression parameters and small area effects are assumed to be serially dependent. The resulting estimators, as well 
as some cross-sectional estimators, are evaluated using bi-annual data from Statistics Canada’s National Farm Survey 


and January Farm Survey. 


KEY WORDS: Composite estimation; State space models; Kalman filter; Fay-Herriot estimator. 


1. INTRODUCTION 


There exists a considerable body of research on small 
area estimation using cross-sectional survey data in con- 
junction with supplementary data obtained from census 
and administrative sources. A good collection of papers 
on this topic can be found in Platek, Rao, Sarndal and 
Singh (1987). Small area estimation techniques in use in 
U.S. federal statistical programs are reviewed by the 
Federal Committee on Statistical Methodology (1993). 
The basic idea underlying all small area methods is to 
borrow strength from other areas by assuming that differ- 
ent areas are linked via a model containing auxiliary 
variables from the supplementary data. It would a!so be 
important to borrow strength across time because many 
surveys are repeated over time. Recently time series 
methods have been employed to develop improved esti- 
mators for small areas; see Pfeffermann and Burck (1990) 
and Rao and Yu (1992). It is interesting to note that after 
the initiative of Scott and Smith (1974) on the application 
of time series methods to survey data, there has only lately 
been a resurgence of interest in developing suitable estimates 
of aggregates from complex surveys repeated at regular 
time intervals; see e.g., Bell and Hillmer (1987), Binder 
and Dick (1989), Pfeffermann (1991), and Tiller (1992). 

In this paper we consider some natural generalizations 
of the best linear unbiased predictor (BLUP) for small 
areas when a time series of direct small area estimates is 
available. An important example of the BLUP for small 
areas is the Fay-Herriot (FH) estimator, which entails 
smoothing of direct estimators by cross-sectional modelling 


of small area totals. The resulting estimators are composite 
estimators (i.e., convex combinations of direct and syn- 
thetic estimators) and are called empirical BLUPs, or 
EBLUPs, whenever estimates of some variance compo- 
nents are substituted in the BLUPs. The work of Fay and 
Herriot (1979) represents an important milestone in the 
field of small area estimation because it is probably the 
first example of a large scale application of small area 
estimation by government agencies for policy analysis. 
With the use of structural models, we derive time series 
EBLUPs which combine both cross-sectional and time 
series data. The models underlying the time series EBLUPs 
were chosen on the basis of general heuristic considera- 
tions rather than formal model testing procedures. Formal 
testing of these types of models with survey data is very 
difficult and not very much is available. Instead, we begin 
with a regression model that is reasonable for the larger 
area, and then allow random small area effects to account 
for any local deviations from the global model. The regres- 
sion parameters and random small area effects are allowed 
to evolve over time according to a state space model that 
was also formulated heuristically. We have not considered 
here the problem of mean squared error (MSE) estimation 
for our estimators. MSEs with respect to the motivating 
models could be defined and estimated for many of the 
estimators; however, the focus of this paper is on the 
performance of the estimators in a repeated sampling 
framework. MSE estimation is an important and difficult 
problem, and the availability of reliable MSE estimators 
could be an important consideration in the choice of 
estimators. 


! A.C. Singh and H.J. Mantel, Social Survey Methods Division; B.W. Thomas, Business Survey Methods Division, Statistics Canada, Ottawa, 
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The main purpose of this paper is to compare time series 
EBLUPs with cross-sectional estimators such as post- 
stratified domain, synthetic, FH and sample size dependent 
estimators. In the time series modelling of the direct small 
area estimates we assume that the survey errors are uncorre- 
lated over time. When survey errors are correlated over time 
and can be modelled reasonably (e.g., ARMA) the approach 
of Pfeffermann (1991) can be used to obtain time series 
EBLUPs via the Kalman filter. Rao and Yu (1992) obtain 
EBLUPs for a model, in which the Kalman filter cannot 
be applied, with survey errors having arbitrary correlation 
structure over time but being uncorrelated across areas. They 
also develop second order approximations to, and estimation 
of, the mean squared error under their model. When a model 
for the correlated survey errors is difficult to specify it may 
be possible, using a suitably modified Kalman filter, to get 
good sub-optimal estimators (Singh and Mantel 1991). 

In this paper we report on an empirical study of the effi- 
ciency of time series EBLUPs. The study uses Monte Carlo 
simulations from real time series data obtained from 
Statistics Canada’s biannual farm surveys. The main 
findings of the study are 

(i) There can be reasonable gains in efficiency with time 
series EBLUPs over cross-sectional estimators. 

(ii) Within the class of time series methods considered in 
this paper, introduction of serial dependence in the 
random small area effects is found to be beneficial. 

(iii) Although any smoothed version of the direct small 
area estimator is expected to be biased, the time 
series EBLUPs exhibit less bias than cross-sectional 
smoothing methods. 


Section 2 contains a description of various cross- 
sectional methods for small area estimation. Time series 
EBLUPs are described in Section 3 and the details and 
results of the Monte Carlo comparative study are given in 
Section 4. Finally, Section 5 contains concluding remarks. 


2. METHODS BASED ON CROSS- 
SECTIONAL DATA 


In this section we describe some well known small area 
estimation methods that use survey data from only the 
current time. Ghosh and Rao (1994) contains a good 
survey of various small area estimators. 

Let © denote the vector of small area population totals 
0,, kK = 1, ..., K. In this section, which deals with 
methods based on cross-sectional data, we ignore the 
dependence of 9 on time ¢ for simplicity. 


2.1 Method 1 (Expansion Estimator for Domains) 


This estimator is given by 


ix = Dy dj. Vy. 


JESk 


where d; is the survey weight for sample unit j. For 
stratified simple random sampling, which is used for our 
simulation study in Section 4, we have 


Sipe = De (N;,/n;) oS Vhj> (Qe) 
h 


JEShk 


where y,,; is the j-th observation in the A-th stratum, sp, 
denotes the set of 7,, sample units falling in the k-th small 
area in the A-th stratum and n,, N, denote respectively 
the sample and population sizes for the 4-th stratum. This 
estimator is often unreliable because n,,;, the random 
sample size in the small area, may be small in expectation 
and could have high variability. Conditional on the realized 
sample size 1);,, 21, is biased. However, unconditionally, 
it is unbiased for 0,. 


2.2 Method 2 (Post-stratified Domain Estimator) 


We will also refer to this estimator as the direct small 
area estimator. If the population size Nj, is known for 
some post-strata indexed by /, then the efficiency of the 
estimator g;, could be improved by post-stratification. 
We define 


Bx = Y Nx Y) a] yd = ING 
i 


JES] k JEST k li 


In our simulations our post-strata are the intersections of 
design strata with small areas which leads to 


22% = ay (Nak/Mnk) a Vaj = by Nie Vnr~, (2-2) 
h 


JEShk h 


This estimator also may not be sufficiently reliable because 
of the possibility of n,,’s being small in expectation. If 
Anz, = O, the above estimator is not defined. It is conven- 
tional to replace ¥,, by 0 when n,, = 0. In the empirical 
study presented in this paper, we replaced y,, by the syn- 
thetic estimate (X,,/X),)¥;,, where X is a suitable covari- 
able, whenever n,, = 0. 

The estimator g>, in (2.2) is conditionally (given 
nN, > 0) unbiased and approximately unconditionally 
unbiased. Appendix A.1 gives details of estimation of the 
conditional mean squared error, v;,, Of 29x. 


2.3. Method 3 (Synthetic Estimator) 


It is possible to define a more efficient estimator by 
assuming a model which allows for ‘‘borrowing strength’’ 
from other small areas. This gives rise to synthetic 
estimators, see e.g., Gonzalez (1973) and Ericksen (1974). 
Suppose different small area totals are connected via the 
auxiliary variable X; by a linear model as 


O, = By +yhoXy, K = 1... -, KB; (2.3a) 
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or in matrix notation 


Ovsnk 3; 


(2.3b) 


where F = (F,, Fo, ..., Fx)’, Fy = (1, X;)’. Now con- 
sider a model for the ‘iret small area estimators g>,’s as 


Een er prre 


where g> SAPs es OR) © = Ey, «8s ER), ees are 
uncorrelated survey errors with mean 0 and variance v,. 
Note that the g>,s are uncorrelated over areas since they 
are conditionally (given n;,) unbiased and the samples in 
different small areas are conditionally independent. 


Denoting by B the weighted least squares (WLS) esti- 
mate of 8, we obtain the regression-synthetic estimator of 
©, under the assumed model as 


§3 = eS 
The above estimator could be heavily biased unless the 
model (2.3) is satisfied reasonably well. The above model 
may not be realistic because no random fluctuation or 
random small area effect (a,, say) is allowed. 


2.4 Method 4 (Fay-Herriot Estimator or EBLUP) 


Using the empirical Bayes approach of Fay and Herriot 
(1979) or the more general best linear unbiased predictor 
approach (see e.g., Battese, Harter and Fuller 1988, and 
Pfeffermann and Barnard 1991), the bias of the synthetic 
estimator can be reduced considerably by using a composite 
estimator; for an early reference on composite estimation 
see Schaible (1978). The composite estimator is obtained 
as a convex combination of gy and a modified g3. For this 
purpose, it is assumed that | . 


6 =F +a, (2.4) 
where a,’s are uncorrelated random small area effects 
with mean 0 and variance w,; known up to a constant. 
In our empirical study later we take w, = w. Thus we 
model g, as 


=FBR+at+e 


(2.5) 


Here q is also assumed to be uncorrelated with €. The 
BLUP of © under the model defined by (2.4) and (2.5) is 


09 
> 
| 


a PON CSR £8) 
(2.6) 
= Ag, + (7 — A)g;, 


where 


35 
Ae CV ORY yr ey POT =V + W, 
V = diag(y,, ..., vx), W = diag(w, ..., Wx), 
and 33 = F 6*, B* is the WLS estimate of 6 under model 


(& 5). Here it is assumed that both the covariance matrices 
V and W are known in computing the BLUP. 

The expression (2.6) follows from the general results 
on linear models with random effects, see e.g., Rao 
(1973, p. 267) and Harville (1976). The BLUP or BLUE 
of F’ 6 is g3 and the BLUP ofa is A(g. — g3). It may be 
of interest to note that the structure of the BLUP does not 
change regardless of whether or not @ is known. However, 
its MSE does change as expected due to estimation of 6. 

When V and Ware replaced by estimates, the estimator 
g4 is termed EBLUP. Note that the model (2.4) is more 
realistic than (2.3), and therefore, the performance of g, 
is expected to be quite favourable. The estimator 24 
approaches g> when the v;s get small, i.e., when the NnkS 
become large. However, it remains biased, in general, 
conditional on 9, with bias tending to 0 as the vzs get 
small. 


2.5 Method 5 (Sample Size Dependent Estimator) 


An alternative composite estimator is given by the 
sample size dependent estimator of Drew, Singh and 
Choudhry (1982). It is defined as 


UZ ei N Sy wall © Sacer SLi 


where A = diag(4,, ..., 6x), 
1 if OP baeds=arhNne, 
JESk 
Sas (2.7) 
Nas d;/NNx otherwise 
JESk 


and the parameter ) is chosen subjectively as a way of 
controlling the contribution of the synthetic component. 
The above estimator takes account of the realized sample 
size n;,’s and if these are deemed to be sufficiently large 
according to the condition in (2.7), then it does not rely 
on the synthetic estimator. This property is somewhat 
similar to that of g4; however, unlike g4, the above esti- 
mator does not take account of the relative sizes of the 
within area and between area variation. Rao and Choudhry 
(1993) have demonstrated empirically how EBLUPs can 
sometimes outperform sample size dependent estimators, 
especially when the between area variation is not large 
relative to the within area variation. Sarndal and Hidiroglou 
(1989) also proposed estimators similar to the above 
sample size dependent estimator. 
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3. METHODS BASED ON POOLED 
CROSS-SECTIONAL 
AND TIME SERIES DATA 


Suppose information is available for several time points, 
t = 1,..., T, inthe form of direct small area estimators 
2>,, where g>, is the vector of estimates g>, in (2.2) based 
on data from time ¢, and also the small area population 
totals for the auxiliary variable. We will now introduce 
some estimators which generalize the Fay-Herriot estimator 
247 in different ways by taking account of the serial 
dependence of the direct estimates See al a! 
Recall that for the Fay-Herriot estimator, the model for 
Q, has two components, namely, the structural compo- 
nent F767 and the area component a7. The estimator g4r 
borrows strength over areas for the current time T and is 
given by the sum of two components, each being EBLUP 
(BLUE) for the corresponding random (fixed) effect, i.e., 


Sar = Fy Br + ar. (3.1) 


Methods based on time series data could, however, borrow 
strength over time as well. Here we introduce three esti- 
mators which are motivated from specific structural 
models for serial dependence. All three of these estimators 
are optimal under different special cases of a structural 
time series model for the direct small area estimates 
{2>,:f = 1, ..., T} specified by the following state space 
model. Let a, denote (G/, a/)’ and H, denote (F;, /). 
Then we have 


gy. On + &, 
(3.2a) 
OQ, = F,B, + a = Hy ow 
and 
CoG Oe teak Sts (3.2b) 
where 
Gm) Ey 
Giz Gals tA Viol (3.2c) 
0 G,? nt 


along with the usual assumptions about random errors, 
i.é., €;, § are uncorrelated, ¢;, is uncorrelated with a@, 
for s < ¢t, and that €, ~ (0, V,), & ~ (O,T;) where 
I, = block diag{ B,, O,}. The covariance matrices Vy Be, 
and Q, are generally diagonal. If G{) = Jand G{?) = J 
then @, and a, evolve according to a random walk. 

This model is in the general class defined by Pfeffermann 
and Burck (1991) using structural time series models. The 
main purpose of their study was to show how accounting 
for cross-sectional correlations between neighbouring 
small areas (in addition to serial correlations) and inclusion 
of certain robustness modifications (to protect against 


model breakdowns) could improve the performance of 
time series model based estimators. They also used the 
maximum likelihood method under normality to estimate 
model parameters. The focus of this paper, on the other 
hand, is on the Monte Carlo evaluation of a special class 
of time series estimators (related to Fay-Herriot) chosen 
on the basis of heuristic considerations and not on the basis 
of model fitting. The methods considered could, therefore, 
be viewed as model assisted methods whose performance 
will be evaluated in a design based (i.e. , repeated sampling) 
framework by Monte Carlo simulation. Moreover, it will 
be seen later that, for the types of serial dependence con- 
sidered, the model parameters can be estimated relatively 
simply by the method of moments, without making any 
distributional assumptions such as normality. 

To find the optimal estimator (BLUP) of 97 in (3.2) 
based on all the direct estimates up to time 7, we first 
found the BLUP &@, of a; from which the BLUP of 87, 
is obtained as Hy Gr. It is possible, albeit cumbersome, 
to get @; directly from the complete data using the theory 
of linear models with random effects. However, since the 
ars are connected over time according to the transition 
equation (3.2b), it is more convenient to compute it recur- 
sively using the Kalman filter (KF). Traditionally KF is 
viewed as a Bayesian technique in which at each time ¢, 
the posterior distribution of a, given data up to ¢ — 1 is 
updated to get the posterior distribution of a, given data 
up to time ¢. Although it is instructive to view KF in this 
manner, it is not necessary under mixed linear models. 
Suppose a7), denotes the BLUP of a7 based on data up 
totimes,s < T.Itis known (see Duncan and Horn 1972) 
that, for the special structure of serial dependence consid- 
ered here, the BLUP &7 of a7 based on data up to time 
T is the same as the BLUP of a7 based on @7), and the 
last T — s observations. In other words, information in 
the previous data can be condensed into an appropriate 
BLUP before augmenting more current data points. A 
good description of the Kalman filter is given in chapter 3 
of Harvey (1989). 


3.1 Method 6 (Time Series EBLUP-I) 


For the first estimator, we let 6, evolve over time 
(e.g., according to a random walk), but assume that a, is 
serially independent. The equations for the state space 
model for this case are similar to (3.2) except that the serial 
independence of the a,s implies G{”) = 0. This will give 
rise to a composite estimator 


Sor = Fr Br + Gr. (3.3) 


Note that Brin (3.3) would now be based on all the small 
area estimates up to time 7 and therefore would be differ- 
ent from 67 of (3.1) which is based on only direct estimates 
at time 7. The estimator @;, as a result, would also be 
different from the corresponding component a7 of (3.1). 


Survey Methodology, June 1994 


In the simulation study described later we take G{) = J, 
B, = diag(y?, y3), corresponding to a random walk model, 
and Q, = 7°J. Appendix A.2 illustrates the method of 
moments estimation of the parameters y;, y3, and 7°. 
The KF may then be run, with initial values for @, and its 
MSE obtained from the FH estimator at ¢ = 1, to obtain 
the EBLUP of @;. Then H7@ris the time series EBLUP-I 
estimator g¢7 at time T. 

As pointed out by a referee, when the number of small 
areas is quite large, or when the variation in @, over ¢ is 
relatively large, there is little difference between g¢7 and 
&4r- Indeed, there is little difference between the perfor- 
mances of these two estimators in our simulation study 
described in Section 4. 


3.2 Method 7 (Time Series EBLUP-ID 


For the second estimator, we let 6, be fixed (it may or 
may not be common for different time points) and let the 
area effects a, be serially dependent according to, for 
example, a random walk. This time series generalization 
could be viewed as an analogue of the model proposed by 
Rao and Yu (1992). The resulting composite estimator will 
have the same form as (3.1), i.e., 


&ir = Fr Br + Gr, (3.4) 


but the component estimates 8 and @ would be different. 
We have two cases. 


3.2.1 Case 1: Suppose the 6;s are fixed and time- 
invariant but the a,s are serially dependent. Then, in 
(3.2), G{) = Tand B, = 0. If QO, is taken as 77/, then the 
only unknown parameter 7? can be estimated by the 
method of moments; see Appendix A.2. We will denote 
by g7,the EBLUP obtained in this case when the parameter 
estimate is substituted. 


3.2.2 Case 2: Here we assume that @;s are fixed but 
different for different time points. The area effects a, 
evolve over time as in Case 1. In (3.2) we have G{') = 0 
and B, = mI where mis a large number. The expressions 
for &; and its MSE obtained from the KF in this case give 
the correct formulas as m — o (see Sallas and Harville 
1981). The KF updating equations for @, in this case take 
the special form 


B, = (F{/A;"'F;) TFL As | (82 i Gz G1); 


Gi, = Gp” By + Pri Ae' (8x — GG _1 — Fi, Bx); 
Pe Piel Pipa, (A, — Pr, A; Fi) Fy) 
Ap Peas 
where A; = P;\,_; + V;, P, is the MSE of @, about @,, 
and P;j,-1 = G{?P,_1{G/?}’ + Q,is the MSE of G{” 


@,_, as an estimator of a,. The time series EBLUP in this 
case will be denoted by g77. 
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3.3 Method 8 (Time Series EBLUP-IID 


For the third estimator, we let both 6, and a, evolve 
over time. This will have more complex serial dependence 
than either (3.3) or (3.4). Its form will be similar to (3.1) 
and can be represented as 


Ser = Fr Br + Gr. (3.5) 


As before, if B, = diag{y?, v3} and Q, = 77J, then the 
model parameters 1”, y?, y3 can be estimated by the 
method of moments as in Appendix A.2. The resulting 
EBLUP of 97 will be denoted by ger. 

It may be of interest to note that many of the estimators 
considered so far are optimal under special cases of the 
model underlying gg7. As has been shown, the time series 
EBLUPs of methods 6 and 7 result from making restric- 
tions on the matrices G, and I’. The cross-sectional Fay- 
Herriot estimators of Section 2.4 result from restricting 
the data to a single time point. The synthetic estimators 
of section 2.3 are special cases of the Fay-Herriot esti- 
mators with zero variance for the random small area 
effects, and the direct (post-stratified) estimator is obtained 
in the limit as the variance of the small area effects goes 
to infinity. 

A further generalization that could be useful is to allow 
correlations between neighbouring small area effects. This 
can be accomplished by allowing the matrix Q, in (3.2) to 
be non-diagonal; however, it is not clear what would be 
an appropriate correlation structure in Q,. 


4. MONTE CARLO STUDY 


The cross-sectional and time series methods were com- 
pared empirically by means of a Monte Carlo simulation 
from a real time series obtained from Statistics Canada’s 
biannual farm surveys, namely, the National Farm Survey 
(in June) and the January Farm Survey. Due to the redesign 
after the census of Agriculture in 1986, the survey data for 
the six time points starting with the summer of 1988 were 
employed to create a pseudo-population for simulation 
purposes. To this, data from the census year 1986 was also 
added. Thus information at one more time point was 
available although this resulted in a 3-point gap in the 
series. The missing data points, however, can be easily 
handled by time series methods. It may be noted that 
although the data series is short, it is nevertheless believed 
to be adequate for illustrative purposes. The parameter of 
interest was taken as the total number of cattle and calves 
for each crop district (defined as the small area) at each 
time point. For simplicity, independent stratified random 
samples were drawn for each occasion from the pseudo- 
population, though the farm surveys use rotating panels 
over time. The dependence of direct small area estimates 
over time was modelled by assuming that the underlying 
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small area population totals are connected according to 
some random process. The auxiliary variable used in the 
model was the ratio-adjusted census 1986 value of the total 
cattle and calves for each small area. This showed high 
correlations with the corresponding variable over time at 
the farm level. Specific details of the empirical study are 
described below. 


4.1 Design of the Simulation Experiment 


First we need to construct a pseudo-population from 
the survey data over six time points (June 1988, January 
1989, ..., January 1991). The actual design involves two 
frames (list and area) with a one stage stratified sampling 
from the list frame and a two stage stratified sampling 
from the area frame, for details see Julien and Maranda 
(1990). We decided to use survey data from the list frame 
only because the list frame corresponds to farms existing 
at the time of Census 1986 and the chosen auxiliary variable 
for model building was based on Census 1986 information. 
Moreover, we chose to use the data from the province of 
Quebec because its area sample is only a minor component 
of the total sample and the estimated coefficient variation 
for the twelve crop-districts (i.e., small areas of interest) 
of this province showed a wide range for the livestock 
variables. It was decided to avoid variability due to changes 
in the underlying population over time by retaining only 
those farms which responded to all the six occasions. Also, 
farm units who belonged to a multiholding arrangement 
in any one of the seven time points (including the census) 
were excluded because of the problems in finding indi- 
vidual farm’s data from the multiholding summary record 
and changes in their reporting arrangement over time. 

The various exclusions described above were motivated 
from considerations of yielding a sharper comparison 
between small area estimators. The total count of farm 
units after exclusions was found to be 1,160 out of a total 
of over 40,000 farms on the list frame. For the pseudo- 
population, we replicated the 1,160 farm units propor- 
tional to their sampling weight so that the total size N of 
the pseudo-population was 10,362, which was manageable 
for micro-computer simulation. 

The pseudo-population was stratified into four take- 
some and one take-all strata using Census 1986 count data 
on cattle and calves as the stratification variable. Although 
we did not consider alternative stratifications or sample 
sizes in our simulation study, there is no reason to think 
that our conclusions would alter significantly if we were 
to do so. The sigma-gap rule (Julien and Maranda 1990) 
was used for defining the take-all stratum. To apply the 
sigma-gap rule we look at the smallest population value 
greater than the population median where the distance to 
the next population value, in order of size, is at least one 
population standard deviation; all units above this point 
are placed into the take-all stratum. The algorithm of Sethi 


(1963) was used for determining optimal stratification 
boundaries for take-some strata. Neyman’s optimum 
allocation was used for sample sizes for strata in order to 
optimize the precision of the provincial estimate of total 
count. This resulted in, from a total sample size of 207 
(2% sampling rate), allocations of 51, 62, 48 and 35 from 
takesome strata with 5,001, 3,188, 1,850 and 312 farms, 
respectively, and the size of the take all stratum was 11. 
The expected number of sample farms in each small area 
varied from 4.6 in area 9 up to 27.5 in area 6, with an 
average of 17.3. The expected number of sample farms 
with some cattle and calves varied from 3.6 in area 9 to 
18.8 in area 3, and the average over the small areas was 
11.7. A total of 30,000 simulations were performed. For 
each simulation, samples were drawn independently for 
each time point using stratified simple random sampling 
without replacement. The 30,000 simulations were con- 
ducted in 15,000 sets of 2 simulations where each set corre- 
sponds to a different vector of realized sample sizes in the 
twelve small areas within each stratum. This was required 
to compute certain conditional evaluation measures as 
described in the next subsection, see also Sarndal and 
Hidiroglou (1989). 


4.2 Evaluation Measures 


Suppose m simulations are performed in which m, sets 
of different vectors of realized sample sizes in domains 
(h,k) are replicated m, times. The following measures 
can be used for comparing performance of different esti- 
mators at time 7. Let i vary from 1 to m, and j from 1 
tO mM). 


(i) Absolute Relative Bias for area k: 


ARB =) noe de S (est, — true,)/true,|. (4.1) 


t J 


The average of ARB, over areas k will be denoted by 
AARB. We take the absolute relative bias since our 
primary interest in this study is in an overall measure 
like AARB; however, in other contexts the actual 
biases for individual small areas may also be of con- 
siderable interest. 


The following measure is motivated by a desire to eval- 
uate the conditional performance of estimators, condi- 
tional on the vectors of realized sample sizes in domains. 
It is conventional to measure performance conditional on 
fixed domain sample sizes; here we consider the standard 
deviation of the conditional bias, B,, as a simple sum- 
mary measure. If this standard deviation is small then the 
method is robust to variations in the realized sample sizes. 
Note that the expected value of By is just the uncondi- 
tional bias which is estimated by ARB,. Let B? denote the 
unconditional expected value of B%. We define the 
following Monte Carlo measure: 
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(ii) Standard Deviation of Conditional Relative Bias for 
area k: 


Y 
SDCRB, = fim Dy (Bey five _ are] : 


By, = my! \s esti, — true,, (4.2) 
J 


ps 2 
Cix = my (m5 —1)7! (x estiix — (3 esti) frm). 
dl 


J 


The correction term C,, adjusts for bias in B4,, as an 
estimate of B%, due to m being finite. B%, — Cy is 
conditionally unbiased for B?,; it is also uncondi- 
tionally unbiased for Bz. The Monte Carlo average 
m,'|Y;(B% — Cx) converges to Bz with probability 1 
asm, — o. B% — Cy, may be negative for some i, 
due to finite m,. For large m, the average over / is 
usually very close to Bz; whenever the average is less 
than ARB; we set SDCRB, to 0. ASDCRB will denote 
the average of SDCRB, over areas k. 


(iii) Mean Absolute Relative Error for area k: 


MARE, = m7! by ye | est, — true, |/true, (4.3) 
bana 


and AMARE denotes the average of MARE, over 
areas. 


(iv) Mean Squared Error for area k: 
MSE, = m7! Y)Y) (estyx — trueg)? (4.4) 
ia Oy 


and AMSE as before denotes the average over areas. 
(v) Relative Root Mean Squared Error for area k: 
RRMSE, = {MSE,}”/true,. (4.5) 


Again, ARRMSE denotes the average over areas. 


The precision (i.e., the Monte Carlo standard error) 
of each measure depends on m,, m). For all measures 
except (ii), the optimal choice of m,, m under the restric- 
tion that m, > lism, = m/2,m, = 2, since this mini- 
mizes the Monte Carlo standard error. To see this, let A 
be the average of an evaluation measure from m), samples 
all with the same sample configuration (set of random 
sample sizes in domains) which we call C. Then the 
expected value of A conditional on Cis a function of C, 
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say E(C), and the conditional variance of A is propor- 
tional to mz_!, say V(C)/m). The unconditional variance 
of Ais then V{E(C)} + E{V(C)}/mp, and the overall 
Monte Carlo variance of an evaluation measure based 
on m, sample configurations replicated m, times is 
V{E(C)}/m, + E{V(C)}/m ym, which is minimized, 
since m = m my is fixed, by taking m, as large as possible. 
For the second measure, the appropriate choice of 1m, mp 
is less straightforward. In the simulation study, m was 
chosen as 30,000 and the corresponding values of 7m, m> 
were set at 15,000 and 2. 


4.3 Estimators Used in the Comparative Study 


There were nine estimators included in the study, 
namely, g, to gg and g*, all calculated for time T = 10. 
We used a simple linear regression model for the synthetic 
component with the auxiliary variable defined as 


XE — 10,/9;) On, (4.6) 


where 0;,, 0; respectively denote the population totals 
for small area k and the province at f = 1, /.e., at Census 
1986. The estimator 6, denotes the post-stratified estimator 
of 6, from the farm survey at time ¢ at the province level. 
Thus X;,; is simply a ratio-adjusted synthetic variable. 
The variances of error components in the regression model 
were assumed to be constant over areas. For time series 
models, it was assumed that the serial dependence was 
generated by a random walk. The above type of model 
assumptions have been successfully used in many applica- 
tions and the main reason for our choice was simplicity. 
It was hoped, however, that the chosen models might be 
adequate for our purpose and might illustrate the differen- 
tial gains with different types of model assisted small area 
estimators, /.e., both cross-sectional and time series 
smoothing methods. 

Since the Census 1986 data was included in the time 
series, the direct estimate g>, corresponds to Census 1986 
and therefore the survey error €, would be identically 0. 
Moreover, from the definition of X;,,, it follows that a 
reasonable choice of (8);, 82,;) would be (0,1) which 
implies that a, must be 0. Thus the covariance matrices 
B, and W, at t = 1 are null and, therefore, the distribu- 
tion of a, at ¢ = 1 would not require estimation. The 
above modification in the initial distribution of a, is 
natural in view of the extra information available from the 
census. Moreover, since the direct estimates gy, were not 
available for t = 2, 3,4, equations for estimating model 
variance components in Appendix A.2 were modified 
accordingly. 

For method 7 (case 1), 8, was assumed to have a 
common fixed value only for ¢ = 2 because at ¢ = 1, 
8, = (0,1)’. For the sample size dependent estimator g; 
the parameter \ was taken to be 1. 


40 Singh, Mantel and Thomas: Time Series EBLUPs for Small Areas 


4.4 Empirical Results 


The main findings were listed in Section 1. Here we give 
some detailed comparisons and some possible explana- 
tions. We do not show separate results for g7 which 
performs slightly worse than, though overall similarly to, 
gz. The estimators are summarized in Table 1. Figures 1 
to 3 and Tables 2 to 4 present some of the empirical results. 
We have not shown the Monte Carlo standard errors but 
they were all found to be quite negligible. 


Table 1 


Summary of Estimators 


g, - Expansion 6 - Time Series EBLUP-I, Bs 
evolve over time, as inde- 


bee n ver time 
g> - Post-stratified pene eve 


g3 - Synthetic g7 - Time Series EBLUP-II, as 
evolve over time, fixed 


g4 - Fay-Herriot SOUT 


gs - Sample Size 
Dependent 


gg - Time Series EBLUP-III, 6s 
and as evolve over time 


Table 2 gives the five evaluation measures averaged 
over small areas, Figure 1 shows plots of the averaged 
evaluation measures relative to the Fay-Herriot (g4) 
value. There is a clear pattern in the behaviour of various 
measures across different estimators. The direct estimator 
g> does very well with respect to the bias measure (AARB) 
but does somewhat poorly with respect to the other 
measures. The cross-sectional smoothing method g; 
(synthetic) does quite poorly with respect to the bias 
measures. The Fay-Herriot method g, performs somewhat 
better than post-stratified on average with respect to the 
MSE measure but is much worse in terms of bias. The 
sample size dependent method g; is quite similar to g», 
slightly worse with respect to the bias measures and slightly 
better with respect to the other measures. The time series 
methods g7 and gg perform quite well overall, though 
they are somewhat worse than g, with regard to bias. The 
performance of the time series estimator g¢ is generally 
between that of Fay-Herriot and the time series estimators 
g7 and gg. For all of the estimators (including the synthetic 
g3) the standard deviation of the conditional relative bias 
(ASDCRB) is appreciable; however, it is smallest for the 
time series methods. As expected, the expansion estimator 
g, does well with respect to the unconditional bias measure, 
AARB, but its conditional performance (ASDCRB) is 
quite poor. 


&1 &2 §3 &4 &5 &6 &7 &8 


Figure 1. Evaluation Measures Relative to Fay-Herriot 
Note: Relative ASDCRB for g,(= 18.98) not shown. 


Table 2 


Average Evaluation Measures 


&| &2 §3 &4 &5 So MCCS 


AARB 001 O07 = 5-097, 8.0655 pe OLS 070 ge 053. 053 
ASDCRB 282, 0160 PeOlG O15 = 0235 Ol0R LOlON O10 
AMARE 269 SA ee eS: OS 136" O09 BOS 088 
ARRMSE see) NIG ash SIS SINKS IANO) NMED att 


AMSE 
(1,000’s) 72,979 27,596 13,382 12,898 22,760 10,603 8,610 8,829 


Figure 2 plots averages of RRMSE, for three size 
groups, namely small, medium and large small areas, 
based on the ranking of their true population totals at time 
T. They are divided up into these three groups because the 
relative errors of estimation would be expected to be larger 
for the smaller totals, and the plots do not contradict this 
expectation. Again, the time series methods g7 and gg 
perform best. Note that the time series method g¢, which 
assumes the small area effects to be independent over time, 
does not do as well. The unaveraged values of RRMSE, 
are given in Table 3. RRMSE5 is relatively large because 
the total number of cattle and calves for area 9 is less than 
half that of any other small area. Areas 6 and 8 stand out 
within the medium size small areas as being most difficult 
to estimate by the smoothing methods. The reason for this 
is that, while there was an overall decline of about 16% 
in the total number of cattle and calves in the pseudo- 
population from June 1986 to January 1991, the decreases 
for areas 6 and 8 were the furthest from the average at 33 % 
and 1%, respectively, so the ratio adjusted covariate 
would be least appropriate for those areas. Nevertheless, 
the time series methods g7 and gg performed significantly 
better than the post-stratified estimator for areas 6 and 8. 
This is because the random walk model for the small area 
effects is able to track small areas which, like areas 6 and 8, 
progressively deviate from the model. 
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©: small‘small areas: 
©: medium small areas 


C): large small areas 


&1 §2 §3 &4 &5 &6 &7 &8 


Figure 2. Relative Root Mean Squared Errors: Averaged 
within Size Groups 


Table 3 


Relative Root Mean Squared Errors and True Total 
Cattle and Calves for Small Areas 


10 18599012360" ©2196) 70738 113 175" 097 103: 104 


Small ll 18,776 | .339 .122 .122 .103 .112 .096 .086 .087 
Size 12 19,819 | .409 .237 .076 .152 .212 .123  .117 .117 
Average | 16,522 | .422 .208 .154 .161 .194 .129 .116 .120 


Aq 


:< small small areas: 
:<C© medium small areas 


:[] large small areas: 


&1 &2 §3 &4 &5 &6 8&7 &8 


Figure 3. Absolute Relative Biases: Averaged within 
Size Groups 


Table 4 


Absolute Relative Biases and True Total Cattle 
and Calves for Small Areas 


1 27,595 

6 29,012} .306 .241 .256 .216 .224 .224 168 .172 
Medium 7 23,600 | .341 .121 .107 .094 .110 .088 .092 .092 
Size 8 IES SS) D0) aS ies I eS oils ee 
Average | 25,959 | .336 .205 .159 .151 .185 .147 .126 .127 

2 35,592] .268 .171 .113 .110 .156 .096 .089 .088 

3 40,582 | .241 .151 .087 .090 .137 .070 .072 .073 

Large 4 42,396 | .256 .160 .099 .103 .144 .080 .088 .089 
Size 5 35,996 | .270 .176 .091 .097 .160 .088 .085 .088 


259.164 .098 .100 .149 .083 .083 .084 


Figure 3 and Table 4 are identical to Figure 2 and Table 3 
in format, but show relative biases instead of relative root 
mean squared errors. The biases for both the expansion 
estimator g, and the post-stratified g, are negliglible. For 
the smoothing methods the average absolute relative biases 
for medium size small areas are relatively large, mainly 
because of areas 6 and 8 for which the covariate is least 
appropriate. Among smoothing methods, the sample size 
dependent g; has the least bias because it is usually very 
close to the direct g; however, it also gains very little over 
g> with respect to mean squared error. Of the remaining 
smoothing methods the time series estimators g7 and ge, 
which had the smallest mean squared error, also have the 
smallest bias. Nevertheless, the relative bias of these 
methods can be quite large, as in areas 6 and 8. In practice 
it would not be possible to estimate these biases; however, 
the possible size of the bias could be assessed using simu- 
lated sampling from a variety of plausible populations. 


True 

St Gl Ke BB SE AS SG EB ASE 

9 8,502 | .002 .047 .232 .139 .085 .099 .061 .069 

10 18,990] .002 .002 .006 .007 .003 .015 .026 .025 

Small 11 18,776 | .002 .009 .090 .052 .021 .062 .039 .037 
Size 12 19,819] .000 .007 .019 .011 .007 .023 .024 .023 
Average | 16,522|.001 .016 .087 .052 .029 .050 .037 .039 

I 27,595|.001 .003 .093 .063 .007 .078 .044 .045 

6 29,012] .000 .001 .239 .157 .023 .195 .120 .123 

Medium a 23,600 | .000 .005 .088 .053 .014 .058 .062 .061 
Size 8 23,627| .002 .008 .143 .106 .024 .124 .093 .091 
Average | 25,959 | .001 .004 .141 .095 .017 .114 .080 .080 

2 35,592] .000 .000 .095 .071 .009 .068 .049 .047 

3 40,582] .000 .001 .047 .041 .005 .029 .026 .025 

Large 4 42,396 | .001 .002 .066 .056 .008 .044 .057 .056 
Size 5 35,996 | .000 .000 .045 .029 .005 .048 .035 .039 


| Average | 38,642 000 .001 .063 .049 .006 .047 .042 .042 


5. CONCLUDING REMARKS 


It was seen by means of a simulation study that small 
area estimation methods obtained by combining both cross- 
sectional and time series data can perform better than those 
based only on cross-sectional data, with respect to both 
bias and mean squared error. However, the cost in terms 
of bias could still be substantial. A question of obvious 
importance is whether it is possible in practical situations 
to judge if the gains from any type of smoothing would 
outweigh the costs, and how to make this judgement. 

The models for the simulation study were chosen on 
general considerations. However, in practice, suitable 
diagnostics similar to those employed in Pfeffermann and 
Barnard (1991) should be developed for survey data before 
any model-assisted method can be recommended. It should 
also be noted that the small area estimators could be 
modified to make them robust to mis-specification of the 
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underlying model as suggested by Pfeffermann and Burck 
(1990), see also Mantel, Singh and Bureau (1993). Finally, 
modification and further extension of the methods pre- 
sented in this paper to the more realistic case of correlated 
sampling errors should be investigated in the future. 
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APPENDIX 


A.1 Variance Estimation for g>,; 


Let v,, denote the conditional (given m,,,;) variance of 
2x7 In (2.2). Then v,, is given by (whenever n,,;, > 0 for 
all h at time f), 


Ver = YY Nike (a os Nai Chke> (A.1) 


h 


where o7,, is the population variance for the intersection 
of the A-th stratum with the k-th small area at time ¢. The 
variance o7,, can be estimated by the usual estimator s7,, 
for Np, = 2. Note that the estimate of the conditional 
variance v;, also provides an estimate of the unconditional 
variance Of 25,,;. 

If n,,, = 1, then we can use a synthetic value as an 
estimate of 07,4, which can be defined as ¥ (nj; — 1) 
Str/¥ (Mnxe — 1), the summation being over all k for 
which “px, = 2 within each (h,t). If njy = 0, vy, of 
(A.1) is of course not defined. With the synthetic value of 
Yner used in this case, we need a synthetic value of its 
mean squared error. For each (h,?t), it can be defined as 


(Xi? iN Es eb as) 8 


where (bias) 2 will be taken as 


»§ ((Xnn/ Xn) Int a Int)? / Mn» 
nnlt>O 


where m,, is the number of small areas with sample in 
stratum / at time f. 


A.2 Estimation of Variance Components 


Using the notation of (3.2), we here illustrate the method 
of moments for estimating variance components for the 
model of Section 3.1 in the special case when there is only 
one auxiliary variable X,,, QO, = 777 and 6, follows a 
random walkie. GX) .= J, bet Fe =", see 
Fa = (1, Xe)’, Br = (B1r,B2r)’, and B, = diag(y{,73). 
The parameter 7” is estimated by the solution of 


ie K 
Ye aa Fe) ee te OC ee 
t=1 k=1 


If there is no positive solution, we set 77 = 0. Here 8, 
denotes the WLS estimate of 6, based on only the cross- 
sectional data at t. This is analogous to the method used 
in Fay and Herriot (1979) for cross-sectional data. An 
estimate of y? can be obtained by solving (for i = 1,2) 


i 
3 (Bit — Bir-1)?/ (> + df?) = T = 1, 
tis? 


where a” is the (i,i)-th element of (F/_, U,<| Fy_,) 7! + 
(HUB) + 
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Jackknife Variance Estimation of Imputed Survey Data 


JOHN G. KOVAR and EDWARD J. CHEN! 


ABSTRACT 


Imputation is a common technique employed by survey-taking organizations in order to address the problem of 
item nonresponse. While in most of the cases the resulting completed data sets provide good estimates of means 
and totals, the corresponding variances are often grossly underestimated. A number of methods to remedy this 
problem exists, but most of them depend on the sampling design and the imputation method. Recently, Rao (1992), 
and Rao and Shao (1992) have proposed a unified jackknife approach to variance estimation of imputed data sets. 
The present paper explores this technique empirically, using a real population of businesses, under a simple random 
sampling design and a uniform nonresponse mechanism. Extensions to stratified multistage sample designs are 
considered, and the performance of the proposed variance estimator under non-uniform response mechanisms is 


briefly investigated. 


KEY WORDS: Item nonresponse; Hot deck imputation; Nearest neighbour imputation; Nonrandom nonresponse; 


Complex survey design. 


1. INTRODUCTION 


All sample surveys suffer from varying degrees of 
nonresponse. While total or unit nonresponse is often 
redressed by appropriate survey weight adjustment, most 
survey taking organizations resort to imputation in the 
case of item nonresponse. In this way, plausible values are 
inserted in place of missing or inconsistent entries, thus 
simplifying estimation of means and totals at all levels of 
aggregation. As early as the 1950’s however, Hansen, 
Hurwitz and Madow (1953) recognized that treating the 
imputed values as observed values can lead to under- 
estimation of variances of these estimators if standard 
formulae are used; underestimation which becomes more 
appreciable as the proportion of imputed items increases. 

A number of remedies to overcome this problem have 
been advanced. In particular, Rubin (1987) proposed 
multiple imputation to estimate the variance due to impu- 
tation by replicating the process a number of times and 
estimating the between replicate variation. More recently, 
Sarndal (1990) outlined a number of model assisted esti- 
mators of variance, while Rao and Shao (1992) proposed 
a technique that adjusts the imputed values to correct 
the usual or naive jackknife variance estimator. The 
Sarndal, and Rao and Shao methods, are appealing in 
that only the imputed file (with the imputed fields flagged) 
is required for variance estimation. No auxiliary files 
are needed. Sarndal’s model assisted approach yields 
unbiased variance estimators, provided the model holds 
(Lee, Rancourt and Sarndal 1991). The Rao and Shao 
adjusted jackknife method is design consistent as well as 
model unbiased (Rao 1992). But while the model assisted 


approach requires different variance estimators for each 
imputation method, the adjusted jackknife method pro- 
vides a unified approach that requires the implementation 
of only one estimator, the jackknife estimator, provided 
the imputed values are adjusted appropriately during the 
variance estimation stage. 

In this paper we describe a simulation study that evalu- 
ates the adjusted jackknife variance estimator of Rao and 
Shao (1992). In Section 2 we motivate the present empirical 
study by demonstrating the characteristics of the naive 
variance estimator under four imputation methods in the 
case of simple random sampling. In Section 3 we briefly 
outline the Rao and Shao adjustment procedure and 
present the empirical results. Extensions to more complex 
designs and experiments with nonrandom nonresponse 
mechanisms are elaborated in Section 4. Finally, in Section 5 
we offer some concluding remarks and recommendations, 
including areas for future study. 


2. BACKGROUND 


Following the notation of Rao (1992), we suppose that 
in asample s, of size n, m units respond to item y, while 
n — munits do not. Denote by y* the imputed value for 
unit /, i€s-s,, where s, is the set of responding units. The 
usual estimator of the mean Y under simple random 
sampling, based on the imputed file is given by 


n= i (Los Yi yr). (1) 


' John G. Kovar, Business Survey Methods Division; Edward J. Chen, Social Survey Methods Division, Statistics Canada, R.H. Coats Building, 


Tunney’s Pasture, Ottawa, Ontario, Canada K1A OT6. 
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2.1 Imputation Methods 


In the present simulation study we consider four simple 
methods of imputation, namely the mean of respondents, 
ratio, nearest neighbour and hot deck imputation methods. 
The reader is referred to Kalton and Kasprzyk (1986) for 
a thorough review of the topic of imputation. The simplest 
and most intuitive method of imputation, when the interest 
lies in estimating the mean of the item y, is to impute all 
missing items with the mean of the observed responding 
units. The imputed value y;, for unit 7, under the mean 
imputation method, is thus given by 


YE Bos enone LE (2) 


In this case, the estimator of the population mean Yin 
(1) reduces to the estimator ; = ¥,,. Due to the fact that 
this method has the undesirable property of distorting the 
distributions, it is used in practice usually only as a last 
resort. It is included here for illustrative purposes. 

Secondly, we consider a ratio imputation method based 
on the assumption that a correlated auxiliary variable x, 
is available, and that the ratio y/x is the same in the s, 
and s — s,sets, as would be the case if the nonresponse 
occurred at random, for example. Under the ratio imputa- 
tion method, we impute the predicted value in place of the 
missing y; as follows: 


yy, =X, (3) 


where X,,, is the mean of the x values of the respondent set 
s,. The estimator of the population mean Y in (1) reduces 
to the double sampling estimator ¥; = (¥p/X,)xX, by 
considering the respondents as the second phase sample. 

The third imputation technique we consider is the 
nearest neighbour (NN) method. Under this method, the 
missing value is filled in by an observed value of another 
unit from the set s,, whose distance to the nonresponding 
unit is minimum. In practice the distance functions used 
are usually the ¢,, &, or ¢. Minkowski’s norms based on 
the auxiliary x-variables, assumed observed for all units 
in s. Thus 


yi = yj, J€S,, such that ll x; — x; ll is minimized, (4) 


where || - ll is one of the above mentioned norms. 


The above three methods are often labelled deter- 
ministic, since, given the sample of respondents, the 
imputed values are determined uniquely. The fourth 
imputation method considered in this study, the hot deck 
method (HD), is non-deterministic, since the imputed 
values are chosen at random from the respondent set. 
While in practice imputation classes are often created and 


some sort of sequential procedure is usually implemented, 
we consider here the pure hot deck, whereby the donor unit 
(/) is chosen at random, with replacement, from the entire 
set s,, that is, 


Ve = V0 GS; (5) 


2.2 Variance Due to Imputation 


Treating the imputed values as observed values, leads 
to the incorrect variance estimator 


Vieriven == 4 lua f)s7/n, (6) 


where s? is the sample variance of the complete sample of 
responding and imputed values, and (1 — /) is the finite 
population correction factor (f = n/N). It can be easily 
shown that the true variance of the estimator J, in (1), 
V(¥,), can be written as (Sarndal 1990) 


VOD = Vic + Views *bVinies (7) 


where V,4», is the sampling variance component, Vj, is 
the variance introduced by the imputation method in 
question and V,,;, is a covariance term between V,,,,. and 
Vimp Which in most cases is negligible or zero. An estimator 
of V.zm could be obtained by adding to v,,i,. a term to 
correct for the fact that the standard formula understates 
the sampling variance component when there are imputed 
values in the data set. To estimate V(¥;), however, an 
additional component of variance due to the imputation 
mechanism, V;,,,, must be estimated. This may be done 
explicitly, as in Rubin’s (1987) multiple imputation, or by 
modifying common variance formulae as in Sarndal (1990) 
and Rao and Shao (1992). Note that the interest lies in 
estimating the variance of the estimator at hand, that is, 
V(¥,), not the variance of an estimator that would have 
been obtained had there been no nonresponse. 


2.3. Variance Underestimation 


To illustrate the seriousness of the underestimation of 
V(¥7) DY Vnaive, and the dependence of the degree of under- 
estimation on the imputation method, we first describe the 
simulation study used for this purpose. We consider a data 
set of 5,620 units with two variables: An auxiliary variable 
x, the Gross Business Income, available for all units, that 
can be used as a measure of size, and a related purchase 
variable y. The correlation between x and y in this par- 
ticular data set is of the order of 0.92. Simple random 
samples of size 200 were selected without replacement. A 
fixed proportion of units were identified at random as 
nonrespondents, having their y-values deleted and imputed 
according to one of the four methods described above. 
Various rates of nonresponse were generated, though, for 
the most part, we confine our reporting to results based 
on 5 and 30% nonresponse rates. 
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To evaluate the performance of the proposed variance 
estimators, we calculate the percent relative bias of the 
variance estimator v., given by 


K ad 
(vy — V(¥7))/K 
Rel.Bias(v.) = > - x 100, = (8) 
k=l V(¥) 


where V(¥;) is obtained through simulation, and v, is the 
k-th realization of the K simulated variance estimates in 
question. Similarly, the percent relative stability of the 
variance estimators is given by 


22 - p 2 
Rel.Stab.(v.) = ) mine eb 
k=l VY) 


x 100. (9) 

All simulations were performed on an IBM PC, using 
Microsoft’s Fortran 77, Version 5.0. In the case of simple 
random sampling, results are based on averages of 100,000 
replications (K = 100,000). With this number of repli- 
cates, the reported relative bias values were observed not 
to vary by more than one percentage point. The results are 
summarized below in Table 1 for the case of 5 and 30% 
nonresponse rates. 


Table 1 


Underestimation of Variance of 7; by the Naive Estimator 
Under Four Imputation Methods, and 5 and 30% 
Nonresponse Rates 


Non- Vari Imputation Method 
response eateries a ees ee Ee ee ee 
Estimator ; 
Rate Mean HD Ratio NN 
5% V(¥p 9.9 10.3 9.5 9.5 
Danie 8.9 9.4 9.2 9.3 
Rel. Bias (Vygiye) —10.7% -—9.4%  -—2.5%  -—2.2% 
30% V(¥p 13-5 16.5 10.1 10.3 
Voniiye 6.5 9.4 8.5 9.0 


Rel.Bias (Vpgiye)  —51.4% —43.4% —15.3% —12.8% 


First, we note in Table 1, that the naive estimator under- 
estimates the true variance of ¥,; by 10.7% in the case of 
mean imputation at a 5% level of nonresponse. About half 
of this underestimation is due to the fact that v,,¢iye under- 
estimates V,,,, and the other half is due to the fact that 
Vnaive ignores the component V;,,.. Sarndal (1990) obtains 
very similar results with respect to the partitioning of the 
underestimation in the case of mean imputation. Secondly, 
in the first row of Table 1, the true variance of ¥, is larger 
in the case of the hot deck imputation as compared to the 
mean imputation, due to the procedure’s inherent vari- 
ability (i.e., the V;,,, component is larger). By contrast, 
V(¥,) is slightly lower in the case of the ratio and nearest 
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neighbour imputation methods, since Vj,,,, decreases as 
the imputation procedure is better able to predict the true 
unobserved values (Sarndal 1990), as is the case in the 
present study due to the relatively high correlation between 
the x and y variables. Thirdly, as can be seen in Table 1, 
V(¥7) increases while v,,iye decreases as the nonresponse 
rate becomes more elevated. As such, the underestimation 
of V(¥,;), when the imputed values are treated as observed 
values, becomes more serious as the proportion of missing 
items increases. The problem is more pronounced in the 
case of the mean and hot deck imputation methods, which 
do not use auxiliary information. Note that underestimation 
of variance in the order of 50%, as was observed in this 
case, can lead to confidence intervals that are about 30% 
too short and to declaration of significance when none 
exists. Also of note is the similar behaviour of the ratio and 
nearest neighbour methods which will be exploited later. 


3. JACKKNIFE VARIANCE ESTIMATOR 


Let ¥;(/) be the imputed estimator of Y obtained 
when the j-th unit is deleted from the sample. Then, in 
the case of simple random sampling, a naive jackknife 
variance estimator of 9, is given by 


sat 
7 = —— VI ili) — Fl”, (10) 


n 5 
J=1 


which can be shown to reduce to Vie (Rao 1992). 


3.1 Imputed Value Adjustment 


In order to produce the ‘‘correct’’ (Rao 1990) jackknife 
variance estimator, Rao (1992) proposed to adjust the 
imputed values as described below. Intuitively, the adjust- 
ment is necessary whenever a responding unit is deleted 
from a jackknife replicate, since in the case of most impu- 
tation methods, all the imputed values depend directly or 
indirectly on the observed value that was deleted. This is 
clear in the case of mean imputation and ratio imputation, 
where all respondents contribute directly to the mean j,,, 
but is less evident in nearest neighbour and hot deck 
imputation methods where the deleted unit contributes to 
the imputation process only in the sense that it is not 
available to be selected as a donor. Thus, whenever a 
responding unit is deleted, a// imputed values in the sample 
must be adjusted before the ‘‘delete-one’’ imputed esti- 
mator of the mean is computed. The adjustment must 
clearly be a function of the imputation method used. In 
the case of the mean and the hot deck imputation methods, 
it can be shown that the following adjustment is appro- 
priate (Rao 1992; Rao and Shao 1992). Let z (j) be the 
adjusted value of the i-th imputed unit y*, when the j-th 
unit has been deleted. Then z¥ (/) is given by 
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Py be yi ts [Vn (J) — Ym] if J ES; 
4 CU Wi 


yi if mypes-sPene dt) 


In other words, no adjustment is necessary if the deleted 
unit (/), has itself been imputed; that is, unit / is a non- 
respondent. In the case of the mean imputation, for 
example, when /€s,, the adjusted value reduces to ¥,(/), 
the mean of the remaining m — 1 respondents, as desired. 
The jackknife variance estimator is evaluated by first 
computing the adjusted imputed estimator j7(/), as 


Ve jo= eG) ela (12) 
is 


and then letting 


n— | 


v9) = WalHt Gelb (13) 


j=! 


n 


It can be shown that the adjusted jackknife variance 
estimator reduces to the correct variance estimator in the 
case of the mean imputation (Rao 1990), and provides a 
consistent estimator in the case of the hot deck imputation 
(Rao and Shao 1992). 


In the case of the ratio imputation, the adjusted values 
are given by 


yi az ES eae if JES, 
rags (ii) = ooo 03) Xm 


Yi if jés-s,, (14) 


where X,,(/) is the mean of the m — 1 sample values of 
x of the responding units when unit / is deleted. The jack- 
knife variance estimator v,(,) is then computed as in (13) 
above, yielding the correct variance estimator. Further- 
more, Rao (1992) shows that not only is the adjusted jack- 
knife variance estimator design consistent (p-consistent) 
under uniform nonresponse irrespective of the model, but 
is also design-model unbiased (pm-unbiased) under the 
model (15) and any nonresponse mechanism that does not 
depend on the y-values. 


End) ee PXip EE 6 | - g°X;, 


COV (Vij) =0 i#jjeés. (15) 


Since the naive variance estimator under the nearest 
neighbour imputation was observed to behave much like 
the naive variance estimator under the ratio imputation, 
the adjustment for the ratio imputation given in (14) was 
used in the case of the nearest neighbour imputation. As 
well, an alternate adjustment was considered, whereby 
unit / was re-imputed using the nearest neighbour method, 


whenever the deleted unit (j ) was used to impute unit 7. 
That is, adjustment takes place only if the deleted unit is 
a respondent (as above), but only those nonrespondents 
in the /-th jackknife replicate that were actually imputed 
using unit / are re-imputed by one of them — 1 remaining 
donors. (This corresponds to imputing the second nearest 
neighbour for these units.) We note that no theoretical 
justification exists for either of these adjustments. Since 
the latter adjustment performed worse than the ratio 
adjustment in our examples, and since its eventual imple- 
mentation in production would be cumbersome, we omitted 
it from further consideration, even though it was always 
observed to be conservative. 

We would like to stress here that for all imputation 
methods the adjustments are only performed for the 
purpose of variance estimation and can be made tempo- 
rarily while the variance estimation program executes. No 
permanent adjustments are required on the imputed file 
used for the estimation of means and totals, though the 
imputed fields must be flagged appropriately. 


3.2 Empirical Results 


The jackknife variance estimator with adjustments cor- 
responding to the four imputation methods described 
above, was computed in addition to v,,j,. in the simula- 
tion study outlined in Section 2. Nonresponse rates of 5 
and 30% were considered and the relative biases were 
calculated. They are summarized in Table 2 below. 


Table 2 


Relative Biases of the Naive Variance Estimator and the 
Adjusted Jackknife Variance Estimator Under 
5 and 30% Nonresponse Rates 


Hebe Ua anee Imputation Method 
Rate Estimator Mean HD Ratio NN 
in percent 
5% V aaive — 10.7 —9,4 —2.5 —2.2 
vy Day, 3.6 3.4 3e7) 
30% Vanive —51.4 -43.4 -15.3 -12.8 
vy 3.3 1.9 3.0 b8) 


Since the adjusted jackknife variance estimator is 
design consistent (p-consistent) (Rao 1992), it performs 
well in the case of the mean, hot deck and ratio imputa- 
tion under uniform response mechanism, as expected. 
(Equally good performance was observed with other data 
sets which do not follow the model (15) as well, but more 
work is needed on this front.) Of note is the relatively good 
performance under the nearest neighbour imputation. The 
proposed estimator tends to be somewhat conservative, 
due, in small part, to the fact that it does not incorporate 
the finite population correction. 
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4. EXTENSIONS 


While the adjusted jackknife variance estimator has 
been shown to perform well in the case of simple random 
sampling under uniform nonresponse mechanism in one 
imputation class, we consider here extensions to more 
complex design, to more than one imputation class, and 
to nonrandom response mechanisms. 


4.1 Complex Designs 


In this section we describe a simulation study that 
evaluates the Rao and Shao (1992) adjusted jackknife 
variance estimator in comparison to the naive variance 
estimator, in the case of stratified multistage sampling 
and hot deck imputation. In particular, data from the 
Canadian Survey of Consumer Finances (SCF) that follows 
the design of the Canadian Labour Force Survey will be 
used. The variable of interest, y, is the total household 
income. The SCF follows a complex stratified multistage 
design with the primary sampling units (psu’s) in the strata 
used in this study selected with probability proportional 
to the number of dwellings. Generally speaking, the psu’s 
are collections of dwellings, corresponding to city blocks 
in urban areas and to groups of Census Enumeration 
Areas (EA’s) in rural regions. We used as a population a 
sample of 3,870 households in 30 strata and sampled two 
psu’s in each stratum. As in the case of the simple random 
sampling study, 5 and 30% uniform nonresponse rates 
were generated at the household level. The missing values 
were then imputed using the hot deck imputation method 
described in Rao and Shao (1992). Briefly, the imputation 
method consists of selecting the donors from the respon- 
dent set with replacement, with probability proportional 
to the survey weight of the donors. 

We first consider the case of a single imputation class. 
Let y,; be the observed value for the k-th unit in the /-th 
psivand the /-th’stratumy (et hie. 2 ,71,, 1 = 1, ..., My, 
ios. .5 lL, fl ——), ), 1p.) ald ict yyy De the COlre- 
sponding imputed value whenever the (//k) unit is a non- 
respondent, that is, whenever (hik)€s-s,. The imputed 
estimator of Y is then given by 


Y, = D Wnhik Ynik + Da 


(hik) €s, (Aik) €s-S 


Wnik Yhik » (16) 


where w,,;, is the survey weight corresponding to unit 
(hik). Under the above hot deck imputation scheme, Y, 
is asymptotically unbiased (Rao and Shao 1992). 

The expectation of Y; under the hot deck imputation 
procedure can be written as (Rao and Shao 1992): 


yy Whik Yhik | > ve x ye Whik 


(hik) €s, (hik) €5, (hik) és 


E.(¥)) 


SIES/TI OCU: (17) 
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thus defining the terms S, T and U. The jackknife ‘‘delete- 
one’’ values are then given by 


3 n 
aie g 
S(gj) = oe Writ Vhik. ay 3 WeikY gik » 
(hik) €s, g (gik) és, 
h#g AJ 
(18) 
NBs n 
T(gj) = YS Whik + ; S i ay Weik » 
(hik) és, & (gik) €s, 
h#g Al 


whenever the j-th psu in the g-th stratum is deleted. The 
adjustment of the imputed values is performed whenever 
the (g/)-th psu is deleted, (hi) 4 (g/), and (hik)€s-s,, 
by letting 


(g/) * | s s | 
i ik + SSS = ||, 19 
Shik Shik T( ij) T ( ) 


Then, analogous to (12) and (13), the jackknife variance 
estimator is evaluated by first computing the adjusted 
imputed estimator Y? when the (g/)-th psu is deleted as 


Pi(gi) = S(gi) + whic Za 


(hik) €s-S, 
+— 2. SY wen 268, (20) 
Cpe tare 
(Aik) €s-S, 
ixj 
and then setting 
PO EE St ee 
v(¥7) = 2 einen OF; y=! © Ot) 


It can be shown that v, as defined in (21), is a consistent 
estimator of the variance of Y; (Rao and Shao 1992). 

We generated 10,000 samples of 60 psu’s selected with 
probability proportional to size, and subjected the selected 
households to 5 and 30% uniform nonresponse. We then 
computed the naive variance estimator, and the adjusted 
jackknife variance estimator, v,, in (21). The relative 
bias (8) and the relative stability (9) were computed for 
both of the variance estimators, and are summarized in 
Table 3 below. 


Table 3 


Relative Bias and Relative Stability (in Parentheses) 
of the Naive Variance Estimator and the Adjusted Jackknife 
Variance Estimator Under 5 and 30% Nonresponse, 
in the Case of Stratified Multistage Sampling 


. : Nonresponse Rate 
Variance Estimator 


5% 30% 
in percent 
Vnaive — 10.3 (88) — 43.7 (84) 
vy —0.9 (97) 1.2 (124) 
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As can be seen in Table 3, the naive variance estimator 
underestimates the true variance of Y at rates comparable 
to the simple random sampling case (Table 2), with the 
underestimation becoming more serious as the nonresponse 
rate increases. The adjusted jackknife variance estimator, 
on the other hand, performs well at both levels of non- 
response, at a relatively modest cost of a slight decrease 
in the stability of the variance estimator, as compared 
tO Vnaive« 


4.2 Imputation Classes 


Under the same sample design as in Section 4.1, we also 
considered the case of more than one imputation class as 
is the case in practice. The household size, known for all 
households in the sample, was used to form two imputa- 
tion classes, namely one member households and more 
than one member households. This was done under the 
assumption that the propensity to respond is different 
between these two classes, while uniform response pro- 
bability was assumed within the imputation classes. Two 
nonresponse schemes were evaluated. The first assumes a 
5% uniform nonresponse in the single member household 
class and 10% uniform nonresponse in the multiple member 
household class, while the second scheme assumes rates 
of 25 and 30% in each of the classes respectively. The 
hot deck imputation, the imputed value adjustments, and 
the adjusted total calculations in (20), Y%,(g/), were 
performed independently within each imputation class 
denoted by v. The terms Y4,(gj) were then summed over 
the two imputation classes, yielding Y$(gj), which was 
used in (21) to provide the estimate v,. The results are 
summarized in Table 4. 


Table 4 


Relative Bias and Relative Stability (in Parentheses) 
of the Naive Variance Estimator and the Adjusted Jackknife 
Variance Estimator Under Two Nonresponse Schemes, 
in the Case of Stratified Multistage Sampling 
and Two Imputation Classes 


: Nonresponse Rate 
Variance Estimator 


5% and 10% 25% and 30% 


in percent 


Vnaive — 16.7 (87) — 40.2 (84) 


vy — 1.0 (103) Lrli(t27) 


As can be seen in Table 4, the adjusted jackknife vari- 
ance estimator v,, performs well under both nonresponse 
schemes. The results, along with those in Table 3, demon- 
strate the consistency and the reasonably good stability of 
the adjusted jackknife variance estimator, even in cases 
of elevated nonresponse rates. 


4.3 Nonrandom Nonresponse 


As demonstrated above, the adjusted jackknife variance 
estimator performs well when the nonresponse is random 
within imputation classes. To study its robustness against 
the uniform response mechanism assumption, we use the 
data set described in Section 2, and generated nonresponse 
as outlined in Lee, Rancourt and Sarndal (1991). In 
particular, the probability of nonresponse is assumed to 
be related to the x-variable in two distinct ways: 


sb cea raion 54) wie ig.) (22) 


Pee TCR Dae X) 3 (23) 


where the constants c,; and cs are chosen such that an 
expected 30% nonresponse rate is achieved. In the model 
P,, given in (22) the nonresponse is positively correlated 
with the x-variable, implying that large (L) units are more 
likely not to respond. The opposite is true in the model Ps 
given in (23), under which smaller (S) units are more likely 
not to respond. Imputation methods which ignore the 
x-variable (mean and hot deck) are expected to yield esti- 
mators of Y that underestimate the true mean under non- 
response model (22) and over estimate the true mean under 
the model (23). However, imputation methods that incor- 
porate the auxiliary variable into the procedure (ratio and 
nearest neighbour), can be expected to produce better 
estimates of the mean. This has been confirmed by simu- 
lation as shown in Table 5 below. As before, 100,000 
replicates were used. 


Table 5 


Estimates of the Mean Y as Percent of the True Mean 
when the Nonresponse is not Random, and the Nonresponse 
Rate is an Expected 30% 


Imputation Method 


Nonresponse 
Model Mean HD Ratio NN 
in percent 
125 60.4 60.4 94.7 O35 
Ps ie p)7/ 17) 102.0 101.4 


Clearly, variance estimation is of no interest when the 
point estimators themselves are highly biased as is the case 
for the mean and hot deck methods. However, in the case 
of the ratio and nearest neighbour methods, under which 
the point estimators perform better, we investigated the 
performance of the adjusted jackknife variance estimator, 
as well as an estimator proposed by Sarndal (1990), which 
can be written as (Rao 1992): 
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Vs(¥7) = 


+ 
aN 


Vm 2m Vn 
te aeee ¢ - \ (24) 


Ym ae Oe CAS 9) 
pon) ey eae pi f 


a 


provided that the finite population correction factor is 
ignored, and that (n — 1)/n = land(m — 1)/m = 1. 
The results are summarized in Table 6. 


Table 6 


Relative Bias of the Naive Variance Estimator, the Adjusted 
Jackknife Variance Estimator and Sarndal’s Variance 
Estimator Under 30% Nonrandom Nonresponse 


Nonresponse Variance Imputation Method 
Model Estimator Ratio NN 
in percent 

(a Day 2G 546 

Vy 3.9 239,5 

Vs EtG 26:8 

Ps ek ~4.0 ~0.7 

vy a7 7 

Vs 2.8 4.5 


In the case of the ratio imputation, the naive variance 
estimator performs quite differently under the two non- 
response models (— 22.7 versus — 4.0%). This is due to the 
fact that while the reduction in effective sample size tends 
to decrease the variance in both cases, under the P; model 
disproportionately more large units are missing which 
tends to accentuate this effect, whereas under the Ps 
model, where disproportionately more small units are 
missing, this effect tends to be partly compensated for. 
Secondly, the adjusted jackknife variance estimator performs 
well in the case of ratio imputation, but relatively poorly 
in the case of nearest neighbour imputation. This is due 
to the fact that the present data set follows the usual linear 
model (15) fairly well and the adjusted jackknife variance 
estimator has been shown to be model unbiased (Rao 1992) 
in the case of the ratio imputation. On the other hand, the 
ratio adjustment does not work well in the case of nearest 
neighbour imputation when the nonresponse is not uni- 
form. The alternate adjustment for the nearest neighbour 
imputation described in Section 3, performs equally poorly 
in absolute terms (not shown here), though the estimates 
are always conservative. Thirdly, the performance of 
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Sarndal’s estimator, vs, is roughly equivalent to that of 
the adjusted jackknife estimator under either the ratio 
or the nearest neighbour imputation methods, and non- 
random nonresponse that depends only on x. 

In cases where the response mechanism is not random, 
and when the propensity to respond is related to the 
variable subject to nonresponse (y), the point estimators 
are themselves severely biased under all four imputation 
methods. As such, variance estimation is of little interest, 
as the real interest lies in estimating the mean squared 
error. That is, more attention needs to be concentrated on 
improving the point estimates and their bias. Some prelim- 
inary results on this front have been put forth by Rancourt, 
Lee and Sarndal (1992). 


5. CONCLUDING REMARKS 


It is well known that the usual variance estimator under- 
states the variance of the estimate of Y in the presence of 
imputed values if these values are treated as having been 
observed. In this study we again demonstrated the high 
degree of underestimation of the naive variance estimator 
in the presence of imputed data. Several imputation 
methods were considered in order to illuminate the depen- 
dence of the degree of underestimation on the method of 
imputation. We evaluated a unified jackknife variance 
estimator proposed by Rao and Shao (1992), an estimator 
that incorporates the variance due to imputation compo- 
nent. The study demonstrated some desirable properties 
of the proposed estimator in the case of both simple 
random sampling as well as complex survey designs. Our 
findings can be summarized as follows. 


(1) The extent of variance underestimation is highly 
dependent on both the imputation method’s ability to 
predict the true values, and its ability to preserve the 
natural variation in the data. 


(2) The proposed adjusted jackknife variance estimator 
offers a unified approach to variance estimation of 
imputed data, that is easy to implement under a 
number of imputation methods and under designs of 
varying complexity. 


(3 


— 


Operationally, no modifications to the original imputed 
file are necessary and the estimation of means and totals 
is thus unaffected by the need to estimate variances. 


(4) The proposed method is easily extended to more 
complex designs, more than one imputation class and, 
with care, to the case of nonrandom nonresponse that 
depends only on available auxiliary variables. 


(5) The adjusted jackknife variance estimator performs 
well whenever the nonresponse is uniform or the usual 
linear model holds, demonstrating the fact that the 
estimatar is both design consistent as well as design- 
model unbiased. 
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(6) In the case of the P; model, under which units with 
large y-values are more likely to not respond, all three 
variance estimators perform extremely poorly. 


(7) In the case of y-dependent nonresponse, better impu- 
tation techniques are needed and the bias of the point 
estimators needs to be studied further. Here the issue 
is primarily that of estimating the mean square error 
rather than the variance. 


Given the relatively high degree of imputation in today’s 
surveys, at least within some imputation classes, it is clear 
that the effect of imputation on variance estimation 
cannot be ignored. An overestimation of precision can 
lead to confidence intervals that are too short and to 
spurious declaration of significance. If implementation of 
the above suggested methods is deemed too onerous in any 
particular circumstance, at the very least studies should 
be conducted to evaluate the impact of imputation in some 
representative cases. An ad hoc variance inflation factor 
could then be implemented. With the emergence of gener- 
alized estimation software, however, there seems to remain 
little reason for not implementing variance estimators 
which correctly account for the effect of imputation. 


There clearly remain many unsolved, and perhaps 
unsolvable problems. To begin with, much more theo- 
retical work is needed with respect to nearest neighbour 
imputation. The jackknife adjustments considered for this 
imputation method fail to perform as well as those applied 
to the other methods. Perhaps smoother alternatives to the 
nearest neighbour method need to be developed. Secondly, 
the robustness of the proposed estimator must be inves- 
tigated. It is clear that satisfactory performance can be 
obtained if the model (15) holds, and when nonresponse 
is random. Limited failure of either one of these condi- 
tions did not seem to detract from the good performance 
of the jackknife estimator in our limited experience, but 
further research along these lines is warranted. Departures 
from both of the conditions simultaneously are yet to be 
investigated. Cases of nonrandom nonresponse when the 
propensity of nonresponse is related to the y-variable are 
even less well understood, though the emphasis in this case 
must be placed on the estimation of the mean square error 
rather than the variance. Thirdly, comparisons to multiple 
imputation results should be considered. It must be recog- 
nized, however, that proper imputation methods (Rubin 
1987) must first be established. We note that none of the 
imputation methods studied within are proper with respect 
to multiple imputation. 


Extensions to other imputation methods and other 
parameters of interest should be undertaken. This study 
was limited to four simple imputation methods. In practice, 
much more complicated methods are used, often in con- 
junction with each other. The impact of more than one 
imputation method on the estimation of variance has 


been studied by Rancourt, Lee and Sarndal (1993); more 
work is needed. With respect to other, more complicated 
methods of imputation, the effect of adding theoretical 
residuals to imputed data can, for example, be considered. 
However, this technique only addresses the underestimation 
Of Voom BY Vnaive and ignores the effect of V;,,,. Finally, 
other parameters, such as the median for example, and the 
effect of imputation on their variance are yet to be eval- 
uated. Multivariate extensions can likewise be considered: 
estimation of correlations, ratios and regression parameters 
in the presence of imputation would likely be of interest. 
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Estimation in Overlapping Clusters with Unknown 
Population Size 


D.S. TRACY and S.S. OSAHAN! 


ABSTRACT 


Two sampling strategies for estimation of population mean in overlapping clusters with known population size 
have been proposed by Singh (1988). In this paper, ratio estimators under these two strategies are studied assuming 
the actual population size to be unknown, which is the more realistic situation in sample surveys. The sampling 
efficiencies of the two strategies are compared and a numerical illustration is provided. 


KEY WORDS: Overlapping clusters; Clustering before sampling; Mean square error; Relative efficiency. 


1. INTRODUCTION 


In cluster sampling, clusters are formed either before 
selecting the sample (CBS) or after selecting the sample 
(CAS). In both cases, clusters may be overlapping or non- 
overlapping. For non-overlapping clusters, much work by 
several researchers is available in the literature. However, 
there are many practical sampling situations where one 
gets overlapping clusters. For example, overlapping 
clusters may exist in some regional epidemiological survey 
for a contagious disease like mycobacterim tuberculosis 
(T.B.), becoming very prevalent with the spread of AIDS 
(Gifford-Jones 1993). Clusters here may be formed around 
infected individuals or closely associated individuals who 
are more vulnerable to the same type of infection. A 
similar situation may exist in an ecological survey where 
clusters are formed around the factories burning coal and 
emitting polyaromatic hydrocarbons (PAH’s) which are 
potent cancer causing compounds. Clusters are formed on 
the basis of the intensity of such gases, and surveys may 
be required in order to control air pollution which causes 
lung diseases like bronchitis. For overlapping clusters, one 
can refer to the limited work done by Goel and Singh 
(1977), Agarwal and Singh (1982) and Amdekar (1985). 
But the methodologies developed by them suffer from one 
limitation or the other. 

Recently, Singh (1988) has developed a very simple 
estimator for a population mean using two sampling 
strategies in the CBS system assuming known population 
size. In the first strategy, clusters are selected with equal 
probabilities, whereas in the second case selection proba- 
bilities are taken proportional to cluster size. The elements 
within the clusters are selected with equal probability in 
both the cases. But it is unrealistic to assume that the actual 
population size is known. If it is the case, then all the 
duplicates in the population are known a priori, and one 


could easily remove them to increase the efficiency of the 
sampling design. Hence, the estimators of the population 
mean studied by Singh (1988) need an improvement in 
order to be practicable, as they depend on the actual 
population size. This limitation in the methodology has 
motivated the present work. 

We propose two sampling strategies in the CBS system 
with simple ratio estimators for the population mean, 
which do not depend on the actual population size. As in 
Singh (1988), an equal probability with replacement 
sampling scheme is used for selecting the clusters in the 
first strategy, whereas in the second, an unequal probabil- 
ity sampling scheme is used. The elements within the 
clusters are selected with an equal probability without 
replacement sampling scheme in both strategies. 

The population of N units under consideration is expres- 
sible in the form of K overlapping clusters with N; units 
in the i-th cluster and ) _, N; = M = N, the unknown 
actual population size, (equality holds only for non- 
overlapping clusters). A population unit may be included 
in more than one cluster. Let y be the characteristic of 
interest and let the population mean be Y. 


Define 


Ts ime Yeas Wi = L/F;;; i= iN, ye Se alt 


ii 


AVG) PS 1h Ba ae ING 


where Yj; is the value of y for the j-th unit in the /-th 
cluster and F;; its frequency of occurring in K clusters. 


When clusterwise data on units are available on the 
computer, the values of these frequencies for overlapping 
clusters may be easily available. As for the example con- 
sidered earlier in epidemiology, suppose we have data 
available for households or individuals along with their 
identification labels like house numbers or social insurance 
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numbers/health card numbers on the computer. Then, by 
giving a simple command to the computer, a researcher 
can easily extract information about the repetition of a 
certain unit from its label in different clusters. Also, in case 
we have a map of the overlapping clusters and the criterion 
for forming clusters does not allow the elimination of 
duplicacy of units in the different clusters, the values of 
such frequencies may be known. 

The two strategies are discussed in section 2 and their 
efficiencies are compared in section 3. 


2. THE TWO STRATEGIES 


The two proposed strategies are discussed in Sections 
2.1 and 2.2. Their comparison is undertaken in Section 3. 


2.1 Strategy A 


This strategy consists of the following steps: 


(a) Select & clusters out of K by simple random sampling 
with replacement (SRSWR). 


(b) From the /-th selected cluster of size N;(i = 1, ..., K), 
select n; elementary units by simple random sampling 
without replacement (SRSWOR). 


Theorem 1. The ratio estimator under SRS 
Ket K 
Zrs = Yrs/Nrs = E Yay wale ye NiW; (1) 
i=1 


has relative bias, to the first order of approximation, 


where 

K 

Oraw = )) (NZ; — Y/K)(N;W; — N/K)/K 
i=1 
Ni 

Siw = CBr THON AA WIN — 1); 
j=l 
Ni nj 

Z; — Lgl; and Hi = yy zy /Nis 

j=1 Jf 


and 03, Si, W;and w; are the expressions of op, Siew, 
Z; and Z; respectively, with z replaced by w and Y replaced 
by N. 


Proof. Following a standard result, the relative bias of the 
estimator Zps, to the first order of approximation, is 


RB(Zps) = [V(Nps)/N?7] — Cov( Yps,Nes)/YN. (3) 


Let E, and V, denote the conditional expectation and 
variance for a given sample of clusters and E; and V, the 
expectation and variance over all such samples. Then, we 
have 


V(Nes) = ViEo(Nps) + E1V2(Nps) 


KE 
ab Ne NE | 


i=] 


I 
= 
Ee 
I 
ree 
2 
~ 
oe 


Similarly, we have 


2 
Cov(Yps,Nps) = op Oban 


By substituting (4) and (5) in (3), we obtain (2), which 
completes the proof of the theorem. 


Theorem 2. The mean square error (MSE) of the estimator 
Zps, to the first order of approximation, is 


where D7 = $2 — 2YS.,+ YS%,, and Sy = yt: 
(Zy — Z)7/(N; — 1). 
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Proof. To the first order of approximation, we have 
MSE (Zprs) = [V( Yis) = DY Cov ( Yeas) 
+ Y?V(Nps)]/N?. (7) 


The expression for V( Ygs) may be written, following 
(4), as 


3 RR en, (alcatel. Vee, 
V( Yrs) = om et ; 3 ni si = Sz (8) 


i=1 
where 03, = Y*%, (N,Z; — Y/K)?/K. 


By substituting (4), (5S) and (8) in (7), we obtain upon 
simplification 


kK? = P 
MSE (Zrs) = KN? (G5, — 2Yopy + Y?opw) 


K 
ten Do nis 7 5) (Sho 2YSgy 1 P7S%) sii) 


Substitution of the expressions for 04,, dp and o}y 
into (9) and simplification yields (6). Now, we provide an 
estimator of MSE(Zps) below. 


Theorem 3. A consistent estimator of MSE (Zs), to the 
first order of approximation, is given by 


we rel oun pe yh OME) 

kN2g peu Xu i Sj ZRSWi)- 
Proof. We note that the first-stage sampling is done with 
SRSWR sampling scheme and the random variables N;Z; 
and N;w; in the ratio estimator are independently and 
identically distributed. Hence, the mean square error of 
Zrs can be estimated using the well-known result that a 
variance estimator for a multi-stage design can consider 
the first stage only (see Sarndal, Swensson and Wretman, 
1992, Results 2.9.1 and 4.5.1). 


From (9), an unbiased estimator of 


1 1 1 
2 3 aM We ee 2 
hig BE on ni( =) Siz 
can be written as 


1 k k 2 
Siz = par by (me —ee Naik) , (11) 


i=1 i=1 


and an unbiased estimator of 
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is s},,, defined parallel to (11). 
Using these results, one can easily show that a consistent 
estimator of MSE(Zps) given in (6) is provided by 


= K* 
MSE (Zs) = <5 (Séz — 2ZRsSbew + ZRsSbw)s 
KNrs 


which can be written as (10). 


2.2 Strategy B 
This strategy consists of the following steps: 


(a) Select k clusters out of K by probability proportional 
to size with replacement (PPSWR) sampling with selec- 
tion probabilities P; = N;/M,i = 1, ..., K. 


(b) Same as for strategy A. 
Theorem 4. The ratio estimator under PPS sampling 
6) Ye Mi Hele Min® 
Zpp = Yep/Nep = — Y\Z%/- W; 13 
RP RENE, SMe Lu | p Lu i (13) 
has relative bias, to the first order of approximation, 


M? Ohi Ofyay! 
RB(Z = _— 
(Zep) k ee i 


K 2 
N; (1 1 Se See 
fb ys ==4|/(= => —— 14 
! le 5) (se ah a 


K 
Gyn 3 (Z; — Y/M)(W; — N/M) (N;,/M) 
i=] 


and 0%, is the expression of 0), with z replaced by w 
and Y replaced by N. 
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Proof. Using a standard result, the approximate relative 
bias, to the first order of approximation, is 


RB(Zpp) = [V(Nep)/N?] 
= Covi NppdNepWilNesi(LS) 


We have 
V(Nep) = ViE2(Nep) + E:Vo(Npp) 


fen fee 
=u iene Salona rac) | 


i=1 


Similarly, one can write 


ee M Ns 1 
Cov ( Yrp,Nrp) = — T [th + yy rs = fa Si 


(17) 


Substituting (16) and (17) in (15), we obtain (14). 


Theorem 5. The MSE of the estimator Zpp, to the first 
order of approximation, is 


ays 
MSE (Zz 2S = 
(Zrp) = awe ja 


a aval 
,— YW;)? —-—-)pD?!. (18 
«| M AG Fe i 2 


Proof. We write, to the first order of approximation, 


N 


MSE (Zep) = [V( Yep) — 2¥ Cov( Yrp,Nep) 


+ Y’V(Npp)]/N*. (19) 


Also, from Theorem 2.5 of Singh (1988), we have by 
analogy 


: M MWAN/1 
ViVi saeco g 
co jo k Day 


i=1 


l 
- Seer (20 
w) Ave 


Nj 


where 03, = ¥ , (Nj/M) (Z; — Y/M)?. On substituting 
(16), (17) and (20) in (19) and simplifying, we obtain (18). 


Theorem 6. A consistent estimator of MSE (Zpp), to the 
first order of approximation, is 


M 1 Ke 


Nap k(k—1) es 


MSE (Zep) = PN OR), 


Proof. As the first-stage units are selected with PPSWR, 
the justification given in the proof of theorem 3 applies 
here, as well. 


From (20), using Results 2.9.1 and 4.5.1 of Sarndal, 
Swensson and Wretman (1992), an unbiased estimator of 


can be written as 
1 é 2 
Sey Lu (« - > z; ik) (22) 
i= LN 
Similarly, defining s,,,,, and See , one can show that 


x, Ma 
MSE (Zep) = —— (S3e 
( RP) mer bz 


3, M1 de 0 
— 22zZRpPSbzw’ + ZRPSbw')> 
which can be written as (21). 


3. EFFICIENCY COMPARISON 


The efficiencies of the estimators are compared below 
under the two strategies. 


Remark. The estimator Zep under strategy B is expected to 
be more efficient than the estimator Zps under strategy A. 


We provide a justification. From (6) and (18), we obtain 


M K 
MSE (Zps) =v MSE (Zep) = kN? ye N; 
i=] 


: |, ~ Wy? + . is wl e z t). 
Nn; N; M 


As the cluster size N; increases, the factor (KN;/M — 1) 
will also increase. The other factor of the term under 
summation is N,[(Z; — YW;)* + (1/n; — 1/N;)D?], 
which represents the contribution due to variability in z 
and w present in the /-th cluster (without the constant 
M/KN7*) towards MSE (Zep) in (18). As cluster size N; 
increases, the contribution of the i-th cluster towards 
MSE (Zpp) is also expected to increase. This makes the 
covariance between these two factors positive. Hence, the 
estimator Zpp is expected to have a smaller MSE than Zps. 
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Population No. | 


Table 1 


Comparison of the Two Strategies for Two 
Small Populations 


Population No. 2 


4 6 
nj 2 2 2 
Y;; 1,3,4,7 | 2,3,6,8,9 4,4,5,6 | 2,3,3,4,5,6 
Fi EC tae Bee Bape ple eles by 
Zij 04,7) 1 251.3:859 4,4,2.5,3 | 2,3,3,2,5,3 
Wij %1,% ]1,%,1,1 | 1,%,%,1,1 1,1,4%,% | 1,1,1,%,1,% 
F 1.38 | 10.16 | 18.12 th 2.94 

MSE (Zps) 2.09 0.45 

MSE (Zep) 1.83 0.33 

R.E. 114.21 136.36 

R.B. (Zps) — .0105 .0348 


R.B. (Zp) — 0047 


Numerical Illustration. Here the two proposed sampling 
strategies are applied to two small populations to shed light 


on the computations of F;;, Z;, and W,, and on their 
comparison. For both the populations K = 3, k = 2, 
M = 12 and N = 9. A unit repeated in two or more 
clusters represents overlapping. The populations are 


described in Table 1. 


The analysis of the results in Table 1 supports the 
theoretical developments of the present paper. For both 
the populations, the factor F = N; [(Z; — YW)? + 
(1/n; — 1/N;)D?] increases with N,, resulting in 
MSE (ep) < MSE(¥ps), as remarked above. 
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CONCLUSION 


This paper removes the realistic limitation of known 
population size in the earlier work of Singh (1988) while 
considering overlapping clusters. Also comparison of the 
two strategies here is more direct, whereas in Singh (1988) 
the support of evidence given by Hansen and Hurwitz 
(1943) was needed. 
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PPS Sampling over Two Occasions 


N.G.N. PRASAD and J.E. GRAHAM! 


ABSTRACT 


The Random Group Method for sampling with probability proportional to size(PPS) is extended to sampling over 
two occasions. Information on a study variate observed on the first occasion is used to select the matched portion 
of the sample on the second occasion. Two real data sets are considered for numerical illustration and for comparsion 


with other existing methods. 


KEY WORDS: Composite estimator; Efficiency comparisons; Random group method; Probability proportional 


to size. 


1. INTRODUCTION 


The practice of using a partial replacement sampling 
scheme in repeated surveys 1s quite common due, in part, 
to an anticipated increase in the efficiency of estimation 
as well as a reduction in the burden of response. Essen- 
tially, after each sampling occasion a fraction of the units 
observed on that occasion is rotated out of the sample and 
replaced by a fresh sub-sample from the population.This 
set of unmatched units is then observed on the next 
sampling occasion along with the remaining set of matched 
units. The literature abounds with discussions of sampling 
and estimation procedures for sampling with equal selection 
probabilities on two occasions. A particularly important 
case is the situation where the units are chosen on a given 
occasion with unequal selection probabilities. In the liter- 
ature to date,information collected on the previous occasion 
is used to improve upon the customary estimator of the 
total or mean for the current occasion by using a difference 
method of estimation. In this article we present a sampling 
and estimation procedure for sampling on two occasions 
which incorporates information collected on the first 
(previous) occasion in selecting the sub-sample for obser- 
vation on the second (current) occasion. For the sake of 
completeness and parsimony, we review only unequal 
probability selection procedures for two occasions in this 
section. 

Consider a finite population of N units, labelled 
1, 2, ..., N, and two sampling occasions: 1 (previous 
occasion) and 2 (current occasion). Let y,; and 2; denote 
the values of a characteristic y for the i-th unit observed 
on the first and second occasions respectively. Let Y; and 
Y, denote the respective population totals. Suppose a size 
measure x is known for each of the population units. 


'N.G. 
J.E. 


1.1 The Des Raj Scheme 


Raj (1965) considered the following PPS (probability 
proportional to size) sampling scheme: On the first occa- 
sion a sample s of size n is selected with probabilities p; 
proportional to the x; values, = 1, 2, ..., N, and with 
replacement (wr). On the second occasion a simple random 
sample s, of m units is selected from s without replacement 
(wor) and an independent PPS samples, of u = n — m 
units is selected wr from the entire population. Then Y, 
and Y, are respectively unbiasedly estimated by: 


Ye Dy Yii/ (pj) (1.1) 
i€s 
and 
Y, = QY2, + (1 — Q) Yam, (1.2) 
where 
oa > Y2;/ (up;), (1:3) 
i€s2 


inn s Vif AD; Dork yy On; = Vi y,enp;)s (..4) 


i€S i€S} 
and Q is a weight, 0 < QO < 1. Assuming that 


N 
V, = ie OTe: — 


i=1 


Y,)*D; = V, 


N 
= ye (¥2;/D; — Y2)*p; = V, (1.5) 


i=1 
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the minimum variance of Y> was found to be 
Vnin( Yo) = VIP 20 Weare (1.6) 


where 6 is given by 


N 
Vo= iB (¥43/Pi3 — XYy) Vai / DF — VO)D;- (1.7) 
i=l 


1.2 The Ghangurde-Rao (G-R) Scheme 


Under the PPSWOR framework, Ghangurde and Rao 
(1969) extended the Rao-Hartley-Cochran (RHC) Method, 
also known as the Random Group Method (See: Rao, 
Hartley and Cochran 1962) to sampling on two occasions. 
Under the RHC Method, the population of N units is split 
at random into n groups of sizes N,, Nz, ..., N, such 
that ¥7_, N, = N, and a sample of one unit is drawn 
independently from each of the 1 groups with probabilities 
proportional to the initial selection probabilities, p;. 
Under the G-R Method, the population is first divided at 
random into v groups, each of size N/n (assumed to be 
an integer). On the first occasion, one unit is drawn from 
each random group (as described above), giving a sample 
s of n units. On the second occasion, a simple random 
wor samples, of m = \n(o < Xd < 1) matched units is 
selected from s and an independent sample s, of u = n — m 
units is drawn from the whole population of N units by 
the same method that was used in obtaining s. Then, a 
composite estimator of Y, is given by 


n=O, + (l= OY, (1.8) 


Where 0 = O% sel, 

: _P* 

= Fe, (1.9) 
and 


Yn = gs nm! De ieeula (1.10) 
i€s Pi i€sy Pi 

with P; and P* denoting the totals of the p; values for the 

groups containing the i-th unit ({ = 1,2, ..., N) inthe 

selection of s and s) respectively. Under assumption (1.5), 

the variance of Y3 (with optimum values of Q’ and )) is 

given by 


p NV 
V, CO) = Se 
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where 


and 


N 
IN ye (ni — Yi) 2x - Y2)/V’. 


i=1 


> 
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1.3. The Chotai Scheme 


Chotai (1974), under the additional assumption that 
n/m is an integer, modified the G-R sampling design on 
the second occasion. A sample s is selected as in the G-R 
scheme on the first occasion. On the second occasion, the 
n units in the sample s are split at random into m(= Xn) 
groups each of size n/m. One unit is selected from each 
of the m groups independently with probabilities propor- 
tional to the P,’s as defined in the G-R scheme. This selec- 
tion yields the sample s,. The selection of s, is the same 
as in the G-R scheme. Then a composite estimator of Y 
is given by 


nO ler aN ee (1.12) 
where 0 < OC <1, 
P -P* 
De ieee (1.13) 
i€ Pi 
52 
and 
s (Yo; — Vii) Pi? Vili 
es a= + (1.14) 
be, Di > Di 


IES} i€s 

Here, P, and P* are as defined in the G-R scheme, and 
P;* denotes the total of the P,-values for those random 
groups of s containing the /-th unit (7 = 1,2,..., N) in 
the selection of s,. The minimum variance of Y£ under 
assumption (1.5), obtained by using the optimum values 
of O© and }, is given by 


NV 


Vinin( YZ) a 2n(N — 1) 


[Do — Ne 2 Po 


(255) 
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Under this scheme, but without assumption (1.5), 
Chotai also considered an estimator of Y> (similar to 
Kulldorff’s estimator for simple random sampling: See 
Kulldorff 1963), given by 


Popul were on aki Qo Ome! yor (16) 
where YS, is as defined in (1.13), O™ (0 < O™ < 1) 
is an assigned weight to be determined and 
. seme Pe P, 
cM ces 3 (V2; BY) i ib 8 oh. ae bi (1.17) 
i 


i€sy I i€s 


with 


Di(¥xi/Pi — Yr)? 


Me 


il 
ws 


(1.18) 
Di(ii/Pi — Y1)? 


Ms 


lI 
pean 


and 6 as defined in (1.7). The minimum variance of Y£™, 
using optimum values of O™ and }, is given by 


A N 
Vinin( YS") = ——————~(1 + 


1 — 62 — n/N). 
2n(N — 1) 


(1.19) 


To actually use Y$™ it is evidently necessary to first 
assess the value of 8, which is usually not possible in prac- 
tice. An estimate of 8 based on the available sample can 
be used but this will induce a bias in the estimator Y&™. 


2. ALTERNATIVE SCHEMES FOR SAMPLING 
PPS OVER TWO OCCASIONS 


We now present an alternative sampling and estimation 
procedure which does not require a known value of 6 as 
defined in (1.18). In this scheme information collected on 
the first occasion is used in selecting the sample on the 
second occasion. The approach is based upon a procedure 
developed by Prasad and Srivenkataramana (1980) and was 
used there in the context of double sampling where a second 
phase sub-sample is selected using information obtained 
from an initial sample. For simplicity, we first consider its 
implementation in Raj’s (1965) scheme (described earlier). 


2.1 A Modification of Des Raj’s Scheme 


On the first occasion a sample s of size n is selected with 
probabilities p; proportional to the x; values and with 
replacement. On the second occasion, instead of choosing 
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a SRSWOR sub-sample, a sub-sample s, of m units is 
selected from s using a PPSWR scheme with size measure 
Zi = Y;/X;, where y,; is the observed value for the y 
characteristic for unit 7 on the first occasion. A sample s, 
of sizeu = n — mis drawn, independent of s, as in Raj 
(1965). A composite estimator of Y is given by 


Y, ra OQ Voy, a Gb Q) Yams 


where Y5, is as defined in (1.3) and 


> 1 (V2i/Pi) 

Yom = ye . 1D (V1i/Pi)> 
in eee (Vi/Pi) * 

with Q being a weight, 0 < QO < 1. The minimum 

variance of Y,, obtained by minimizing the variance of Y> 

with respect to Q, is given by 


Vinin(Y2) = ViC\(n + Cym)7, 


where Cy = Yq (yxi/pu — Y2)*puiVi', with py; = 
yi;/Y, and V, as defined in (1.5). 


2.2 A Modification to Chotai’s Scheme 


As in Chotai (1974), assume that N, n, and m(< n) 
are all positive integers such that N/n, N/u and n/m are 
also all integers. Then: 


1. For the first occasion select a sample s of size n in the 
same manner as that adopted in the G-R procedure. 
For this set of units, observations y;;,/ = 1, ..., 7, 
are made on a characteristic y. 


2. For the second occasion, (a) split the ” units in s at 
random into m groups, each of size n/m and draw one 
unit with PPS, p* = ()1;P;)/p;, independently from 
each of the m groups, yielding a sub-sample s,, where 
P;is as defined in the G-R scheme; (b) select sz, a fresh 
sample of u = n — m units from the entire popula- 
tion, and observe the second occasion y values, y>;, for 
these u units in the same manner as in the G-R scheme. 


Note that the difference between the proposed proce- 
dure and that of Chotai (1974) lies in the selection of s;: 
in the former, information collected on the first occasion 
is used in selecting s,; on the second occasion. 

We now consider an estimator of the second occasion 
total Y, that exploits the proposed procedure. Let 


Seek; 
Jape = : 
i 


A composite estimator of Y, is given by 


TOY tl OY, (2.1) 
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where YS, is defined as in (1.13), 0 < O** < land 


i€S} Pi 


Here P, denotes the total of the p* values associated 
with those units that belong to the random group from 
which the /-th unit was selected in s,. Let E; and FE, denote 
expectation and V, and V, denote variance over all s and 
for a given s, respectively. The unbiasedness of Y3,, and 
hence of Y¥ for Y follows by noting that the expected 
value of Y%,, is 


ol Bos Yo, P; 
E(Y3in) — E\E(Y3m) a a(X it 


i€s 


) = ¥ (2.2) 
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To obtain the variance of Y3,,, consider 
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which leads, after considerable algebraic simplication, to 
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N 


agiOny i m 
bie? 03 = V,= Y) (yxi/pi — Yo) pj and A= i 
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Because YS, and Y%,, are independent, the variance of Y¥ 
is given by 


V(¥5), = OOK GY oe) tt pe OM Van): 
where 
i N- 
Vive) Sate 
u(N — 1) 
and V(Y3,,) is given by (2.3). 


The minimum variance of V( Y#) is obtained by using 
optimum values of Q** and X, respectively given by 


(= nny +5; 
ins Vex PITS Ry nENY 
(1 ami N sa oe Ree 
Chi 2) 
and 
ie vh 
. Te Ne 


Hence, the minimum variance of V( Y) is given by 


2 
Se ee ee (2.4) 

n(N — 1) 

Note that the quantity h reflects the efficiency of the 
estimator using the p,’s as initial selection probabilities 
over the estimator with initial selection probabilities 
y1;/Y,. A “‘small’’ value of h leads to an increase in the 
efficiency of the proposed method over Chotai’s. 


3. NUMERICAL EFFICIENCY COMPARISONS 


The composite estimators Y¥ defined in (1.12), YF” 


defined in (1.16) and Y¥ defined in (2.1) are now compared 
at their respective optimum Q and ) values. The efficiency 
of the scheme proposed in 2.2 relative to Chotai’s (1974) 
procedure is examined through a comparison of the 
following two relative efficiencies: 


Vinin(Yy) — (1 — n/N) + J2(1 — 6) 


RE ——— 
Vininle) (1 a n/N) AP vh 
and 
Riot Vink ) = (1 n/N) a {1 6 
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evaluated respectively obtained using (1.15) and (2.4), and 
(1.19) and (2.4). It follows that the proposed scheme is 
superior to that of Chotai using Kulldorff’s estimator 
(which depends on the unknown constant ) for those 
populations having h < (1 — 67). In order to permit 
meaningful numerical comparisons, two data sets that 
have appeared elsewhere in the literature are used here. 


Data Set A: This data set relates to the area under wheat 
in 1964 (y2), in 1963 (y,) and cultivated area in 1961 (x) 
for 34 villages in India (See Murthy 1967). The parameter 
values for this data set are 6 = 0.6404 andh = 0.1868. 


Data Set B: This data set relates to the area under wheat 
in 1937 (yz) and in 1936 (y,) and cultivated area in 1930 
(x) for a sample of 34 villages in India (see: Sukhatme, P.V. 
and Sukhatme, B.V. 1970). The corresponding parameter 
values for this data set are 6 = 0.7635 andh = 0.3811. 


Using these values for 6 and h the two relative effi- 
ciencies values RE] and RE2 (expressed as percentages) 
were computed for selected values of n/N and are given 
in Tables 1 and 2. 


Table 1 
RE1% - Values for Data Sets A and B 


n/N Data Set A Data Set B 
0.05 130.09 124.30 
0.10 131522 125221 
0.15 132.43 126.19 
0.20 183575 127.25 
0.25 135.18 128.41 
0.30 136.73 129.66 
Table 2 
RE2% - Values for Data Sets A and B 
n/N Data Set A Data Set B 
0.05 104.49 101.82 
0.10 104.64 101.88 
0.15 104.80 101.94 
0.20 104.97 102.01 
0.25 105.15 102.08 
0.30 105.34 102.16 


An examination of Table 1 leads to the conclusion that 
the proposed scheme out performs that of Chotai (1974). 
The gain in the efficiency ranges from 30% to 37% for 
Data Set A and from 24% to 30% for Data Set B as the 
sampling fraction varies from 0.05 to 0.30. Note that the 
increase in efficiency is greater for Data Set A than for 
Data Set B because of the difference in the value of the 
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parameters h (0.1868 vs. 0.3811) and of 6 (0.6404 vs. 
0.7635). Recall that / measures the efficiency of p; as a 
size measure for unit / compared to the use of y,; as a size 
measure in estimating the total Y, for the current occa- 
sion and 6 is the correlation between y,;/p; and );/p; as 
defined in (1.7). When / is relatively small, greater gains 
in efficiency are realized with the proposed scheme than 
when / is not small. In both cases, however, the efficiency 
gains using the proposed procedure are worthwhile. 


The efficiency gains using the proposed method com- 
pared to the use of Chotai’s scheme with Kulldorff’s 
estimator (as reported in Table 2) are minimal, varying 
from 4.5% to 5.3% for Data Set A and from 1.8% to 
2.2% from Data Set B. But in order to use Kulldorff’s 
estimator, the value of 6 must be available. In practice 
this is not the case. It follows that the proposed strategy 
performs well from the point of view of actual implemen- 
tation and of efficiency gain. 


There are situations where the auxiliary information 
needed to compute the initial selection probabilities is not 
available. A simple random sampling scheme may then be 
used in place of the RHC procedure in selecting the sample 
for the first occasion enumeration; the RHC procedure 
can then be adopted in selecting s,; by using the SRS infor- 
mation on the study variable collected on the first occasion. 
The theory for such a procedure follows directly as a 
special case of that presented by taking p; = 1/N,i = 1, 
..., N. One would anticipate that substantial gains in 
efficiency would then result in this situation. 
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Multi-way Stratification by Linear Programming 


R.R. SITTER and C.J. SKINNER! 


ABSTRACT 


Rao and Nigam (1990, 1992) showed how a class of controlled sampling designs can be implemented using linear 
programming. In this article their approach is applied to multi-way stratification. A comparison is made with 
existing methods both by illustrating the sampling schemes generated for specific examples and by evaluating mean 
squared errors. The proposed approach is relatively simple to use and appears to have reasonable mean squared 
error properties. The computations required can, however, increase rapidly as the number of cells in the multi-way 
classification increase. Variance estimation is also considered. 


KEY WORDS: Controlled selection; Linear programming; Multistage sampling; Stratified sampling. 


1. INTRODUCTION 


There are often several stratifying variables available 
to the sample designer and it is natural in such cases for 
the designer to consider defining strata as the cells formed 
by cross-classifying categories of these variables. A problem 
with this approach, particularly common when selecting 
primary sampling units (psu’s) in household surveys, is 
that the desired sample size may be less than the total 
number of cells and hence conventional methods of 
stratification may be inapplicable. 

An illustration, based on a hypothetical example of 
Bryant ef al. (1960), is given in Table 1. Communities 
(psu’s) are classified by two stratifying factors: type of 
community with three categories and region with five 
categories. The desired sample size of n = 10 is less than 
the total number of cells, 15. This example also illustrates 
arelated problem. The entries in Table 1 are the expected 
counts under proportionate stratification, that is the 
population proportions multiplied by the sample size. 
Even if the sample size was doubled to exceed the number 
of cells, the expected sample counts would still not be 
integers. Whilst the effect of rounding such values to 
integers may not be practically significant for large 
expected counts, the choice of how to round with very 
small expected counts may be of greater concern. 

One reaction to the problem of many cells is simply to 
drop one or more of the stratifying variables or to group 
some of the categories. Alternatively, a number of proce- 
dures have been proposed which attempt to retain some 
‘control’ for all the categories of all the stratifying variables 
by permitting different forms of random selection of cells. 

Goodman and Kish (1950) proposed one procedure 
under the title ‘controlled selection’. Jessen (1970) suggests 
that ‘this method is somewhat complicated and its use in 
applied sampling appears limited’ (p. 778). Waterton (1983) 


Table 1 


Expected Sample Cell Counts Under Proportionate 
Stratification with n = 10 


Type of Community 


Regions 
Urban Rural Metropolitan Total 
1 1.0 0.5 0.5 2.0 
2 0.2 0.3 0.5 1.0 
3 0.2 0.6 1D 2.0 
4 0.6 1.8 0.6 3.0 
5 1.0 0.8 0.2 2.0 


Total 3.0 4.0 3.0 10.0 


illustrates this complexity. Bryant et a/. (1960) propose a 
much simpler method for two-way stratification. Their 
method has the property that the expected sample counts 
display independence between the rows and columns of 
the two-way table. If the rows and columns are also inde- 
pendent in the population then there is no problem but if, 
as will often be the case, there is an appreciable lack of 
independence then some reweighting will usually be neces- 
sary and this can be unattractive in practice and can inflate 
the variance as is shown in Section 5. Jessen (1970) points 
out that a further limitation of the method of Bryant et ai. 
(1960) is that it is not possible to constrain specified cell 
sizes to be zero. He proposes two approaches for both 
two-way and three-way stratification but both approaches 
remain fairly complicated to implement and, as noted by 
Causey et al. (1985), do not always lead to a solution. 
All the above methods may be carried out by hand with 
varying degrees of laboriousness, but none take advantage 
of the power of modern computing. In this paper we shall 
show how computational procedures of linear programming 
can be applied to the multi-way stratification problem 
following Rao and Nigam (1990, 1992). Our proposed 
approach may be viewed as complementing the linear 
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programming approach proposed by Causey ef al. (1985). 
Which of the two approaches is preferable will depend on 
the nature of the stratification problem and on the soft- 
ware available. The potential disadvantage of our approach 
is that it can be much more computationally intensive, 
since the number of unknowns in our linear programming 
problem may be as large as (), when k is the number of 
cells in the table and n is the sample size, whereas the 
number of unknowns in the approach of Causey et al. 
(1985) is only k. A number of suggestions will be made, 
however, to reduce the computational demands of our 
approach. There are several potential advantages of our 
approach. First, the stratification problem corresponds 
directly to the linear programming problem and so the 
computer programming is straightforward, whereas the 
approach of Causey et ai. is less direct, involving mimicking 
the behaviour of nonlinear functions by linear functions 
(p. 904) and nesting repeated linear programming problems 
within a further recursive algorithm. Second, our proce- 
dure always has a solution, whereas the procedure of 
Causey ef al. need not, for example in cases of three-way 
stratification. Third, the objective function in our linear 
programming problem can be naturally modified to reflect 
the different objectives of the stratification problem, for 
example in a three-way problem where it is more important 
to ‘balance’ the sample with respect to the first two strat- 
ifying variables than the third. Fourth, our procedure can 
be naturally modified to constrain the joint inclusion 
probabilities of cells to be positive in order to permit 
unbiased variance estimation. 


2. THE PROPOSED APPROACH 


2.1 Basic Ideas 


We begin with the simplest kind of two-way stratifica- 
tion. Let a population of N units be classified into the RC 
cells of a two-way table formed by cross-classifying a row 
stratification factor with R categories and a column factor 
with C categories. Let N;; be the number of units in cell 
ij, that is the set of units in both row / and column /, and 
let Pj; = Nj /N be the corresponding proportion. The 
parameter of interest is taken to be the population mean, 
Y, of a variable Y. 

Consider the following two-stage sampling procedure. 
First, sample sizes n,; are determined for each cell accord- 
ing to a specified randomized procedure. Letting s denote 
the Rom Garay (tgs ulidues Ral S lyaessC)4 this 
procedure assigns a probability p(s) to each sina set S 
of possible arrays. To emphasize the dependence of nj; on 
s we write nj;(S). Second, a simple random sample of 
nj(S) units is selected from cell /j and the values of Y are 
recorded for the sample units. 


We restrict attention to designs of fixed sample size 
n > O, that is we restrict S to be the set S,, of all arrays 
such that 


njyj(s) =n. 


G 
=1 


R 
i=l j 
Wealso restrict attention to proportionate stratification 

so that 


yy ny (s)p(s) = nPj for lel ee 
SESp 
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It follows from (2.1) that the simple unweighted sample 
mean (s) is an unbiased estimator of Y. We propose to 
choose a (or the) sampling design p(s) which minimizes 
the expected lack of ‘desirability’ of the sample s by solving 
the problem: 


minimize 3 w(s)p(s), (2.2) 
i SESy 


subject to the constraint (2.1), where w(s) is a loss func- 
tion for the sample s to be specified and P is the class of 
possible sample designs on S, obeying 

O2sip(s) SMe torallbesie. Sz (2.3) 
Note that (2.1) implies ¥,< s, (Ss) =. The key observa- 
tion of Rao and Nigam (1990, 1992) is that the objective 
function in (2.2) and the equality and inequality constraints 
in (2.1) and (2.3) are all linear in p(s) and hence this 
problem may be solved directly by linear programming 
with the p(s), s € S,, as unknowns. The main obstacle to 
this approach is that the number of elements in S, is often 
very large and even with modern computing power it 
becomes difficult to carry out linear programming if the 
number of unknowns is large. 

It is therefore desirable to restrict attention to a subset 
of S,. One natural restriction is to consider only arrays 
s for which nj(s) is either equal to J; = [nPj], the 
greatest integer less than nPj;, or J; + 1. Letting Aj (s) = 


y? 
nj(s) — Tj and rj = nPj — Ij; the problem becomes 


minimize Ny w(s)p(s), (2.4) 
eke s€Sj 
subject to 
yy fy (s)p(s) = ry, (2.5) 
s€Sj 


we pt(sS\ss 15.0. = p(s) sa) itor all's se Sauer (.6) 
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where S, is the set of R x C arrays, where all elements 
are 0 or | and the sum of elements is 7# = n — Vij Ij. 
Note, of course, that if all the /;; are zero, then this is just 
the same problem as before. The number of elements in 
S,, Which determines the magnitude of the computational 
task for linear programming, is now (8). This number 
can still be very large, however, and some further reduction 
can be achieved by sensible choice of the loss function 
w(s) as discussed in the next section. 

For Table 1, this would amount to considering the 
situation represented by Table 2, while only allowing a 0 
or | cell sample size, and then adding back 1 to cells (1,1), 
(3,3), (4,2) and (5,1) in the final solution. Thus 7 = 10, 
Lue Se 


Table 2 
Table of rj Values from Table 1 with A = 6 


Type of Community 


Regions 
Urban Rural Metropolitan Total 
1 0.0 0.5 0.5 1.0 
2 0.2 0.3 0.5 1.0 
3 0.2 0.6 0.2 1.0 
4 0.6 0.8 0.6 2.0 
5 0.0 0.8 On 1.0 
Total 1.0 3.0 2.0 6.0 


2.2 Choice of Loss Function w(s) 


The major flexibility of the proposed approach derives 
from the user’s freedom to choose the function w(s) 
which enters the objective function in (2.2). The conven- 
tional approach to two-way stratification (e.g., Jessen 
1970; Causey ef al. 1985) is to require that the selected 
sample s obey the marginal constraints: 


Cee nee 17 = te... RR; O27) 

ae JED. Al f= 15s C, (2.8) 
where 

nj.(S) = 


Yt GS)>, By (S) = S5ty (8) 
J 


i 
J i 


This requirement can be accommodated in our approach 
by setting w(s) as (effectively) infinite for samples s not 
satisfying (2.7) or (2.8) or more simply by excluding such 
samples from the set S,. The problem with this conven- 
tional approach is that no solution to the constrained 
optimization-problem may exist. 
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In our approach, however, if we use a loss function 
such as 


R G 
w(s) = J) (nj.(s) — nP;.)? + YY) (nj (s) — nP.;)?, 
at adi (2.9) 


then an optimal solution will always exist within a large 
enough set S,,. In practice, it may be advantageous com- 
putationally to restrict the set S,, initially to only those 
samples obeying (2.7) and (2.8), or even a subset of these, 
and then to expand the set if necessary, say by changing 
1 to 2 in (2.7) and (2.8), until a solution is found. 

Let us now consider the more fundamental question of 
why constraints such as (2.7) and (2.8) are sensible 
anyway. From a non-statistical point of view, the balancing 
of a sample with respect to factors with a known population 
distribution may reassure users about the ‘representa- 
tiveness’ of the sample. From a statistical point of view, 
given our unbiasedness constraint (2.1), it is natural to 
consider how the loss function might be chosen to improve 
efficiency. This question may be examined by taking w(s) 
as the mean squared error E,,(7(s) — Y) * under a model 
m. Then the solution to the optimization problem (2.2) 
minimizes the design-expected model-mean squared error 
or equivalently, since we require design-unbiasedness, the 
model-expected design variance. 


Consider, for example, the main-effects analysis of 
variance model 


Vik = w+ o + By + Ex, 


where y;;, is the kth value of Y in cell ij, » is a fixed mean 
and a;, 6; and €,, are independent zero-mean random 
effects with variances 02, oj and 02, respectively. Then, 
ignoring finite population correction terms, 


Em(¥(S) — ¥)? = 02 YY (n,.(s)/n — P..)? 


+ 03 )\ (nj(s)/n — Pj)? + o2/n. (2.10) 
J 


Hence, if o2 = of the expected design variance of j(s) 
under this model is minimized by taking the loss function 
in (2.9). Alternatively, if one had some prior information 
about the likely ratio of the between row variance relative 
to the between column variance then it may be sensible, 
on efficiency grounds, to modify the loss function in (2.9) 
by multiplying the first term on the right hand side of (2.9) 
by this estimated ratio. 

On the other hand if it is thought a priori that there is 
likely to be a strong interaction between the row and 
column factors in their effect on Y then simply attempting 
to balance on the margins may be inappropriate. For 


68 Sitter and Skinner: Multi-way Stratification by Linear Programming 


example, if one stratification factor is urban/rural and the 
other is an economic indicator X and it is known that Y 
is positively related to X in urban areas and negatively 
related in rural areas then it is likely to be more efficient 
to stratify partially by _X separately within rural and urban 
areas than to balance fully on both margins. See Bryant 
et al. (1960, section 9) for related comments on efficiency 
for two-way stratification. 


2.3. Higher-way Stratification 


The proposed approach extends naturally to 3 or more 
stratifying factors by letting s denote the corresponding 
r-way array. The loss function will typically include further 
terms, for example for three-way stratification we might 
take 


Ri 


w(s) = "Ry" )) (nj.2(s) nP. 


i=) 


RQ 
+ Ny» DF C1565) ere nP,,..)7 


i= 
R3 

aA ep Dal C07 1G) ang 9's 
k=1 


in obvious notation, where X,, \7 and ); are included to 
represent the relative importance of balancing on the three 
factors and might consist of prior estimates of the 
variances of the Y means between categories of the three 
stratifying factors, as in (2.10). 


2.4 Multistage Sampling 


One important practical application of multi-way 
stratification is to the selection of primary sampling units 
(psu’s) in multistage sampling, where it is common for 
information of several stratifying factors to be available. 

In the approach of Section 2.1, the inclusion probabil- 
ities of each population unit are E(nj; (s)/Nj) = n/N. 
If it is desired to select psu’s with equal probability then 
this approach extends immediately with the psu’s con- 
stituting the units and with the observed values of Y 
replaced by unbiased estimators of the psu totals. Suppose 
instead that it is desired to select psu’s with unequal pro- 
babilities, say nzj, for psu k in cell ij, where usually z;;, 
will equal Mjjx/¥ ijx Mix, with M;;, being some measure 
of size of psu k in cell ij. Then the procedure may be simply 
modified by setting P; equal to the sum of z;;, over psu’s 
kin cell ij. Then, if nj (s) > 0, a sample of nj (s) psu’s 
in cell ij is selected by some probability proportional to 
Zijx method. 


3. EXAMPLES 


Example 1: Bryant, Hartley and Jessen (1960) 


We will first demonstrate the method on the hypothet- 
ical example of Bryant ef a/. (1960) given in Table 1. We 
first reduce the problem to the form of (2.4), (2.5) and 
(2.6), where the r;;’s are given in Table 2. The weight 
function in (2.9) in this reduced linear programming 
problem becomes 


5 3 
w(s) = Y> (%.(8) — 7,.)? + YY (A = 75)” 
i=] 


joi 


Applying a standard linear programming package in the 
NAG FORTRAN library, we obtain the solution given in 
Table 3. The J;; values have been added to the solution so 
that n, = Ij; + “jj (S). It turns out for this solution that 
each s, for which p(s) > 0, has margins n;. (s) and n.;(s) 
which match the desired margins exactly, that is the 
solution makes (2.4) zero. 


Table 3 


Solution to Example | 


Ky p(s) Ss D(s) 
i ot 3@ atk 
WU) Om Onl 
(Oe bah 0.2 Oe il ail 0.1 
Oy 2 il ib al al 
it A Al ie th 
ene) Oy 
OMOmrn Oak Gy 
OMION 2 0.2 1 (Om AL 0.2 
ee en eg) Ol 2 wt 
ete a Decl aaa) 
0.1 0.2 


ee OOF 
Hee RO 
oOrror 
== OO 
HNeEoO 
CORR 


Example 2: Jessen (1970) 


Jessen (1970) proposed two methods for two-way and 
three-way stratification. Both of these are quite compli- 
cated and involve determining the set of samples which 
exactly match the margins. Neither method is guaranteed 
to yield a solution. Jessen (1970) applies both methods to 
a simple hypothetical example for which both yield a 
solution. This example is reproduced in Table 4. In this 
example, since all of the nP;; < 1, the linear program- 
ming problems defined by (2.1), (2.2) and (2.3) and by 
(2.4), (2.5) and (2.6), respectively, are identical. We 
applied our method to this problem, again using the w(s) 
as defined in (2.9). By trying a number of different seeds 
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in the optimization routine, we were able to obtain three 
different solutions, all of which make (2.2) zero and satisfy 
the constraints. These are given in Table 5. The first two 
solutions are the same two as obtained by Jessen’s method 2 
and method 3, respectively. 


Table 4 


Example 2: Jessen (1970) 


Expected Sample Cell Counts Under Proportionate 
Stratification with n = 6 


Columns 
Rows ——_ nP;. 
1 2 3 
1 0.8 0.5 0.7 2.0 
7 0.7 0.8 0.5 2.0 
3 0.5 0.7 0.8 2.0 
nPs; 2.0 2.0 2.0 6.0 
Table 5 
Solution to Example 2 
s p(s) p2(S) p3(s) 
One 
ih ak 05) 0.4 0.3 
OO aheil 
Pie le O 
Osler 0.3 0.2 0.1 
ih) i) 
Ont 1 
erie ait 0.2 0.1 0.0 
ih eA) 
eel) 
Io ape 0.0 0.1 0.2 
Oar er 
ihoe@ al 
Dasogtle sy il 0.0 0.1 0.2 
1 0 
(2). alle el! 
PAHO 0.0 0.1 0.2 
i Oe 


Example 3: Causey, Cox and Ernst (1985) 


Causey ef al. (1985) give an example of three-way 
stratification for which their method fails to yield a solu- 
tion. They consider a population subject toa2 x 2 x 2 
stratification from which a sample of sizen = 2 is to be 
drawn, with the-expected sample size in the ijxk-th cell, 
Nijx, as follows: 


Ay = N21 = N22 = N72 = «5 


N21) = Ny = N12 = Ny = 0. 
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If we apply our method in a similar manner to 
Examples | and 2 we obtain the solution given in Table 6. 
In this case, the objective function did not attain zero so 
that the margins are not exactly matched in each sample. 


Table 6 


Solution to Example 3 


RY 


p(s) 
al ip 9) 
ib e@ Oe a 0.5 
0 0 0 0 
0 0 0 0 0.5 
Orel 170 


4. COMPARISON OF MSE 


In this section the mean squared error (MSE) of the 
proposed design with estimator ¥ will be compared with 
the MSE of the design of Bryant et a/. (1960) with either 
of the two estimators they propose, namely jy and Jp, 
where the U and B subscripts indicate that the first 
estimator is unbiased and the second is not. Let the cells 
be denoted c (ij in the two-way case), let k (and where 
necessary /) denote a unit within a cell, and suppress the 
sin n,.(s) for simplicity of notation. The inclusion proba- 
bility of any unit k in cell c is 


Tee = E[n.|/Ne = Ene] /( NP.) (4.1) 


and the joint inclusion probability of unit k in cell c and 
unit k’ in cell c’ is 


E[ne(ne-1)] 


Ne(Ne=1) 1 EAE 
Teke'k? = (4.2) 
cke’k Fina : a 

NoNo’ 1 (G Gut. 


For large N this is approximately 


Sy ees E(n,ne:) el E(n-) 
Tao SNe PSp,.  (NeP? 


Tiezets (4.3) 


where 


I a tiie —Fe2 
aa 0 if c#c’. 


The expectations will differ for our design compared to 
the Bryant et a/. design and thus the z,, and mx¢," will 
differ. Keeping this in mind we can obtain the variance 
of ¥, fy and jz in a generalized form in terms of the 7, 


70 Sitter and Skinner: Multi-way Stratification by Linear Programming 


and xe, values and thus have some basis for compar- 
ison. To do this, let us consider an estimator of the form 
Z = YeLewWVex/n, where the w, values are fixed known 
constants independent of k. If we restrict to the case where 
n;. = nP;. and n.; = nP.;, that is, integer marginal 
requirements, then both of the estimators given in Bryant 
et al. as well as our estimator are of this form. We will 
assume this to be the case in the sequel. Replacing the 
subscript c with ij for two-way stratification, jy and Jz are 
of the same form as Z with mw. = wy = Gy = Pj/(P.P,;) 
and w. = w, = 1, respectively. The estimator Jy is also of 
the form Z with wy, = wy = 1. 

We can now obtain a general form for the variance of 
Z keeping in mind that the 7, and mx¢, values will differ 
for the Bryant ef a/. design and our design: 


1 
V(zZ) = an De ye D > (WexMe'k’ — Weke'k’) 
c k’ 


(tlle 5 


(Wek 57 WeVork') >. (4.4) 


Using (4.1) and (4.3) this becomes 


ge (Yer — Yeu’)? 
kk’ 


1 w2E(ne) 
V Zz = (Ss c 
ol an Lu IN-RH 


(Week r~ Wer York) ?« (4.5) 


Noting that 
» ye (Vex 7 Yet)? = 2N*P2S2 
fe 
and 
aoe (Week a We Yorn) * = N’P.P,, 
kK 
[weSe + werSt + (We¥, — Wer ¥o-)7], 


where S? refers to the population variance of cell c, (4.5) 
reduces to 


Lu 1 
V(z) = => )) weE (ne) Se 


1 
= an? »y > Cov (Nene) [We Sz at Weise: 
Cee G4 


at (weY, i We Yor) 7] 


Vv; + Vo, Say. (4.6) 


The first term v, may be interpreted as the usual stratified 
variance for fixed sample sizes E(n,) within the two-way 
‘strata’ (of course in our case the £(n,) will generally not 
be integers). The second term v, may be interpreted as the 
increase in variance arising from the variability of the n, 
and the correlation between them. We discuss this further 
at the end of this section. We now revert to the notation 
c = ijand compare the variances for two-way stratification. 
First let us consider v, in (4.6). For the Bryant ef al. 
method E(nj) = nP;.P.j,¥u = Lid jLeGy Vix/n, Gy = 
Py /(P-Pj) and Jp = Lilgleviye/N- 
Thus 


¥i(¥u) = ye dD P; Gj Si /n, 
ip ey | 


(this is the same as the first term of equation (12) in Bryant 
et al.) and 


i 


vi(¥a) = YS PP y Si /n. 
J 


In the case of our approach E(n,;) = nPj and y = 
Li Lj Le Wijx/n So that 


vi(¥) = Ve 


Umea) 


P, Sij /Nn. 


Next let us consider v. It is not difficult to show that 
for both the Bryant ef al. method and our approach 
(see Appendix) 


y) Cov(ny nj) = Y) Cov(ny,nyj) = 0. (4.7) 


U J 


Using this and replacing c and c’ by ij and 7’/j’, respec- 
tively, in vz given in (4.6), it follows that v, reduces to 


; t 
ee YY YS Cov lyse) wywig YyFip 
Le Tye mie es 


Replacing w, with Gj we get v.(Vy), and using simple 
algebra one can show that this is the same as term 2 of equa- 
tion (12) in Bryant et a/. Replacing w,; with 1 gives the form 
of V(yg) and of V(y), noting that the Cov (nj ,;;’) will 
not be the same for both. So we see that v, depends only 
on the cell means while v,; depends only on the within cell 
variances. 


Finally, we should note that 


bias Va) masher Er Pe) Kieu GS) 
at 3) 


since to compare the three estimators the mean square error 
(MSE) will be the relevant measure, and this bias will con- 
tribute to MSE(¥,). 
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Combining the expressions for v,, v» and bias (jz) 
above permits an analytical comparison of the MSE of the 
proposed approach with that of the approach of Bryant 
et al. (1960) using either ¥y and fz. It is difficult, however, 
to make general statements about the relative performance 
of the different strategies and so we now consider intro- 
ducing some model assumptions in order to approximate 
the different components of the MSE expressions, in some 
specific settings. We first consider the additive model: 


Vijk = w+ a + B+ Exx: 


where yj, is the k-th observation in the /j-th cell, a; and 6; 
are fixed effects and €;, are independent errors with zero 
mean and common variance o”. Then E,, (Sj) = o* and 
Em(Yj¥erj-) = (u + a; + B) (um + a + B;-). Thus the 
model-expected design-variance is given by replacing Si; 
by o” and Y,;, by » + a; + 8; in the formulas for v; and 
v> for the various estimators. In this case, v2(¥g) = 0. 
This point was realized by Bryant eft a/. when comparing 
¥y and zg. The bias term will be zero in this case unless 
there was rounding on the margins, that is bias(¥,z) = 0 
provided n;. = nP;. andn.; = nP.;as is the case in their 
example. This easily follows from (4.8) and 


ee eel) =o), (ry Pe Py) — 0. 
. j 


This was also shown by Bryant ef a/. p. 119 equation (47). 
Using (4.7), it is easily shown that v.(¥) = 0 as well. 
This combined with the unbiasedness of j and the fact that 
v1 (5g) = V¥1(¥) = o7/nin this case implies that for this 
situation MSE(¥gz) = MSE(j), that is the proposed pro- 
cedure has the same MSE as the procedure of Bryant e¢ al. 
using the biased estimator. We demonstrate in the sequel 
that even when this additive model is applicable (y = 0 
below), v2(¥y) may be large while v; (Jy) > v,(¥). 
To compare the estimators further, let us consider the 
situation of Example 1. The above derivations allow us 
to obtain the MSE’s of the three estimators for this 
example provided we have the S,;’s, the Y;;’s and can 
calculate the Cov (nj; ,n;;,) for the Bryant et a/. method 
as well as for our approach. The covariances for the 


to 


Bryant et al. method are given in their paper in terms of 
the P;;’s, while the covariances for our approach can be 
obtained from the solution in Table 3. We will consider 
non-additive departures from the above model, namely 


Vic pk ep as, PF yea Syn 


for various values of y. For simplicity of presentation, let 
w=1,a,; =i - 3, B =J — 2 (note in fact that the 
MSE of each strategy is invariant to the choice of 1). Thus 
the model-expected design-variance is given by replacing S ji 
by land ¥; by1 + (- 3) + ((-—2) + y(i- 3) VU — 2) 
in the formulas for v, and v, for the various estimators. 
Table 7 gives the resulting v,;, v2, and MSE values for the 
three estimators (as well as the bias squared term for Jz), 
for various values of y. From Table 7, it can be seen that 
for an additive model, y = 0, ¥g and ¥ perform equally 
well, while yy is inferior. As the model becomes more 
non-additive, and| y | increases, the two estimators for 
the Bryant e¢ a/. strategy tend to perform similarly, both 
with MSE becoming increasingly greater than that of the 
proposed strategy. This pattern is primarily due to the v, 
component of the MSE of the three estimators. The bias 
term of jg is of lesser importance, although it may be 
more important for larger n. 

The greater increase in v, as| y| increases for the 
Bryant ef al. design appears to reflect the greater 
variability of each n;; for this design. It should be noted 
that it would have been possible to reduce this variability 
somewhat by applying a variant of the Bryant ef al. 
method instead to Table 2, as was done for the proposed 
method, though one would need to derive adjusted Gj 
weights for ¥y and it would be difficult to handle the 0.0 
cell entries in Table 2. However, even if this were accom- 
plished, the 7; for this design may still take values other 
than just 0 and 1; for example 747 could take values 0, 1, 
or 2. This inflated n, variability is inherent in the Bryant 
et al. method. For example, suppose nj. ase 
Then using the Bryant et a/. method, n,, can take values 
0, 1, 2, 3, 4, or 5, while with the proposed method it can 
take only values [nP,;] or [nP,,;] + 1.IfnP,, < 1, the 
technique used to go from Table | to Table 2 will not 
improve matters. 


Table 7 
Comparison of MSE for Three Estimators 


Bryant, Hartley, Jessen Design 


JU 
vy 1) MSE vy 
0 0125 105 .230 .100 
25, ve) .063 188 .100 
= 2125 105 .230 .100 
ated 125 .440 565 .100 
ss3 25 eli 1.236 100 


Proposed Design 


YB y 
Bias” MSE v1 V5 MSE 
.000 100 .100 .000 100 
.002 135 100 018 118 
.008 239 100 071 Ant 
032 655 100 284 384 
.073 1.349 100 638 738 
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5. VARIANCE ESTIMATION 


In this section, we will consider variance estimation for 
our proposed method. Using (4.1) and recalling constraint 
(2.1), it is clear that 


Kay = ElnsS\iNe) = nin. 


The joint inclusion probability of two units k, k’ in the 
same cell c is 


Tek,ck’ = E[ne(s) {ne(s) — 1J/{Ne(Ne — 1)3)- 


Suppose n,(s) = J, + A,(s) when J, is the fixed integer 
(aP.wand its)s 0 OR 


If nP. s 1 then J, = 0 and mx -~, = 0. Hence a 
necessary condition for unbiased variance estimation to 
be possible is that nP. > 1 for all cells c. On the other 
hand if this condition holds then n,(s) = 1 for all c and 
hence the probability of inclusion of any pair of units in 


different cells is also always positive. Hence this condition’ 


is necessary and sufficient for unbiased variance estima- 
tion to be possible. 


When this condition holds we obtain 
Tek ,ck’ = LeU a ale > 1)/[N.(N. > 1)] = Aes 


SAVe WHEDS Fa 1.5 ed ae 


The joint inclusion probability for pairs of units in 
different cells c and c’ are 


Wek,ck’ = E[n.(s)ne (S)/(NeNe-) J 


= Wely rte ead} ae Wed ig si lee’ | / (NeNev) a Boers 
(5.1) 
say where 7,2. = E[f.(s)Ae+(s) ). 


Hence an unbiased estimator of V(¥(s)) of Sen-Yates- 
Grundy form may be constructed in the usual way. 

In practice, however, we wish to consider situations 
where nP. < 1 for some c. In this case one assumption 
we might make following Bryant et a/. (1960, Sect. 7) in 
order to derive a variance estimator is that the population 
variance of Y is constant within each cell c, say Sey 


Let us first obtain the variance of ¥(s) in the general 
case 


. 1 n2 
V(V(s)) = oy pe? ee Si A.) Cy. eye 
1 72 
. an? de D2, (s - Bw (Vex —\ya/e0) ‘ 
CAC iar 


Now providing B,,.. > 0 Vc, c’ we may estimate the 
second term unbiasedly by 


72 
Nc(S) Ne’ (s) et 


Ep iphw dap Paes 
2n? sa0iR. as 


A k=1 k’=1 Cer 


(Ver — Vern)’, 


Where AM ="| 07S (S) a Le aS eee tne en a 


The first term can be written as 


: > (z A.) 2N2S? 
Pe i es args 
2n ; N 


For any cs.t. n.(s) = 2 


ne(s) ne(s) x WZ 
E ys y (Vex Vek ) n(s) = s2. 
k#k 


Thus provided at least one n,(s) is = 2 an unbiased 
estimator of the first term is 


1 n2 nc(s) ne(s) 
team ce Pe. 2 
2n?D uy 'e A.) ini 

{e:ng(s) 22} 4 


(Yew — Ver’)? 
2n-(S){n-(S) = 1} 


where D = the number of cells, c, such that n.(s) = 2. 


The above requires B,.. > 0. If 
8 re te 


by (5.1), we need 
ec’ = )) fic(S) fic (s)p(s) > 0, (5.2) 


which is linear in p(s). The constraint (5.2) can be handled 
in linear programming if desired. There will be such a 
constraint foreach: pair c, c’.s.t..f. — 1,,-—.0: 


6. CONCLUDING REMARKS 


We have proposed a linear programming approach to 
multi-way stratification, applying ideas of Rao and Nigam 
(1990, 1992). The approach is simple in conception and 
is very flexible in allowing for a range of different objec- 
tives via the loss function w(s), as well as in permitting 
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a variety of constraints such as that the joint inclusion 
probabilities of all cells be positive. The main practical 
constraint on the procedure is that it may rapidly become 
computationally expensive if not impossible as the number 
of cells in the multi-way classification increases. Some 
ideas on how to reduce the amount of computation have 
been considered. Further research on this question would 
be useful. For cases where the computational demands are 
prohibitive, the method of Causey ef a/. (1985) remains 
an alternative. 
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APPENDIX 


Proof of (4.7) for Proposed Method 


Note that 
Cov (nj(s), nj (s)) = E(ny (s) nj (s)) 


— E(nj(s) )E (nj; (s)). 


Equation (2.1) states that E(1,(s)) = nP,. By definition 
E(nj(s)njyj(s)) = 3 Nj (S)njyj (S)p(s). 
AY 
Thus 


YE (ay (8) )E (ny (s)) = 0? Pig Y) Py = Pep Pie 
J Jj (7.1) 


and 


yy E (ny (srry (s)) = YY) ngs) ny (8) p(s) 


af J KY 


yy Pls)niy(s) Y) ny(s). 
: J (7.2) 
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Assume that the solution to the linear optimization 
problem (2.2) equals zero, where w(s) is given in (2.9). In 
this case, )j nj (s) = nj.(S) = nP;. and (7.2) implies 


3 E(njj(s)njj(S)) 


J 


yy) Pls) nj (s)nP,. 


Discs yoni (s)p(s) 


nNP,.E(njj/(S)) = nP;.nP-j- . 
(723) 


Equations (7.1) and (7.3) together imply ¥}; Cov(n,(s), 
njj(S)) = 0. It can be similarly shown that 


a Cov (njy(s),nj7/(S)) = ye Cov (njj(S), Ni; (S)) 


= ve Cov(nj(s),njj;/(s)) = 0. 


J 
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Regression Weighting in the Presence of Nonresponse 
with Application to the 1987-1988 Nationwide 
Food Consumption Survey 


WAYNE A. FULLER, MARIE M. LOUGHIN and HAROLD D. BAKER! 


ABSTRACT 


A regression weight generation procedure is applied to the 1987-1988 Nationwide Food Consumption Survey of 
the U.S. Department of Agriculture. Regression estimation was used because of the large nonresponse in the 
survey. The regression weights are generalized least squares weights modified so that all weights are positive and 
so that large weights are smaller than the least squares weights. It is demonstrated that the regression estimator 
has the potential for large reductions in mean square error relative to the simple direct estimator in the presence 


of nonresponse. 


KEY WORDS: Non-negative weights; Consistency. 


1. INTRODUCTION 


In many sampling situations, the population means of 
auxiliary variables are known, but the particular values of 
the variables for individual elements are not used in the 
sample selection. Although the information is not used in 
the sampling design, it may be highly desirable to incor- 
porate the information about population means into the 
estimation procedure. Common estimation procedures 
utilizing auxiliary information are ratio estimation, post- 
stratification, regression estimation, and raking. Regression 
estimation is the most general procedure in that the regression 
method can handle multiple auxiliary variables, continuous 
auxiliary variables, and discrete auxiliary variables. Post- 
stratification can be considered a special case of regression 
estimation in which the regression variables are indicator 
variables for the post strata. The raking procedure, also 
known as iterative proportional fitting, is restricted to 
auxiliary information in the form of discrete categories. 
Deming and Stephan (1940), Stephan (1942), El-Badry and 
Stephan (1955), Ireland and Kulblack (1968), Darroch and 
Ratcliff (1972), Brackstone and Rao (1979), and Oh and 
Scheuren (1987) are references on raking. 

Early applications of regression estimation are Watson 
(1937), Cochran (1942) and Jessen (1942). Cochran (1977, 
Ch. 7) contains the basic theory. Regression estimation 
for survey samples has been discussed by numerous 
authors, including Mickey (1959), Fuller (1975), Royall 
and Cumberland (1981), Isaki and Fuller (1982), Wright 
(1983), Luery (1986), Alexander (1987), Bethlehem and 
Keller (1987), Copeland, Pritzmeier, and Hoy (1987), 
Lemaitre and Dufour (1987), Sarndal, Swensson and 
Wretman (1989), Deville and Sarndal (1992), Zieschang 
(1990), and Rao (1992). 


In much of the cited literature, regression estimation 
is described as a procedure for reducing variance in prob- 
ability samples. In practice, one of the motivations for 
regression estimation is the potential for reducing bias 
associated with selective nonresponse. See, for example, 
Little and Rubin (1987, p. 55) for the special case of 
adjustment cells, and Bethlehem (1988) for the generalized 
regression estimator. 

Nonresponse prompted the use of regression estimation 
in our application and we discuss regression estimation in 
the response adjustment context in Section 3. The standard 
regression estimator and the modified procedure that 
produces positive weights are introduced in Section 2. 
Application of the regression weighting procedure to the 
Nationwide Food Consumption Survey is described in 
Section 4. 


2. REGRESSION ESTIMATOR 


To introduce the regression estimator used in our study, 
assume that a sample containing 7 units has been selected 
and that the probability of selecting unit / is z;. For our 
purposes, it is sufficient for z; to be proportional to the 
selection probabilities. The sample might be a two-stage 
stratified sample, and the unit can be either the primary 
sampling unit or the observation unit. In our application, 
the unit is the observation unit. Assume that a k-dimensional 
vector of population means, denoted by X = (X,, X, 
..., Xz) is known, that the vector (y;, X15 Xi2. «+ +» Xix) 
is observed for every unit in the sample and that an esti- 
mator of the mean of y is desired. We assume that the first 
element of x; is one for all 7. Hence, the first element of X 
is also one. The vector x; = (Xj, Xj2, - - -, Xj) 1s sometimes 
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called the vector of control variables. A regression esti- 
mator of the mean of y is 


Vere (2.1) 
where 


n =i] n 
B= ( yg unix) ‘is xi mt; ‘Vis (2.2) 


i=1 i=1 


and we have assumed Y x; 7; ‘x; to be nonsingular. This 
definition of the regression estimator follows Mickey 
(1959) who suggested restricting the term regression 
estimator to estimators that are location and scale 
invariant. The estimator (2.1) can also be written as 


Dr = i Wi Vis (2.3) 
i=1 


where 
n en 
Wi = x( a uinr'x,) De mre (2.4) 
i= 


and the weights have the property, 
i WX; = eke (2.5) 


The weights of expression (2.4) are relatively easy to 
compute, and once computed, can be used for the esti- 
mation of any y-variable. If the vector x; is replaced by 
the vector 
C25) CX Na te ee) 2) 


the estimator can be written in the form 


— A 


Dy Sad weck 1lZ ura ele ew) eee 6 x1) 


where Z = Ois the population mean of z;, Z, = X, — X, 


abi rte 
(Vr2r) = ( ») x") > Ti ' (VirZi) 


i=1 i=1 


and 
A te ~1 
B, = ys (Rj = Sq) a (BF 2 
j=l 


n 

s r,_ —1 
D ea = Zn) Tj Jie 
j=! 


In the form (2.7), ¥, is the intercept in the regression of 
yon z. Thus, the theory given by Fuller (1975) for regres- 
sion coefficients is applicable to the regression estimator 
of the mean. If the population total of units is known and 
denoted by WN, the estimated population total is NV,. 

By defining a sequence of populations and samples, it 
is possible to show that the estimator (2.1) is a consistent 
estimator of the mean of y. See, for example, Fuller (1975). 
The estimator of the variance of the regression estimator 
is a function of the joint probabilities. Consider a stratified 
two-stage sample and replace our single subscript 7 with 
the triple 0jt. Then, omitting the finite correction term, a 
variance estimator is 


lé 
VGS Sue kee nee (eal eare 
t=1 


a) 
Wrildy + 4.) C8) 


J=1 
where 


mo i 
dy. = S Wet (Yor — Xr B)s 


t=1 


n, is the number of sample primary sampling units in 
stratum , mp is the number of sample elements in primary 
sampling unit / of stratum ?, 6 is the vector of coefficients 
defined in (2.2), n is the total number of elements in the 
sample, and wy, is the weight for element ¢ in primary 
sampling unit / of stratum @. The factor n — kis used by 
analogy to the divisor for the unbiased estimator of the 
error variance in ordinary regression. When the vector of 
control variables is coded as in (2.6), the estimator (2.8) 
is the estimated variance of the first element of 8, the 
estimated intercept. The estimator (2.8) was suggested in 
Hidiroglou, Fuller and Hickman (1976) and the consistency 
of the estimator was established by Fuller (1975). Also see 
Sarndal, Swensson and Wretman (1989). 

The estimators, constructed with weights (2.4), have 
good large sample properties. However, they may have 
undesirable behavior in small samples. Because the weights 
are linear functions of x;, it is possible for some of the 
weights to be negative. Negative weights make it possible 
for estimates of positive parameters to be negative. Early 
research on methods of constructing nonnegative regres- 
sion weights was conducted by Husain (1969). Huang 
(1978) designed a computer program to produce non- 
negative regression weights. Huang and Fuller (1978) 
described the weight generation procedure and showed 


Survey Methodology, June 1994 


that the large sample distribution of the modified estimator 
is the same as that of the ordinary regression estimator. 
Also see Goebel (1976). 

The computer algorithm of Huang (1978) is an iterative 
procedure based upon the ideas of generalized least 
squares. The goal of the Huang algorithm is a set of 
weights w;,i = 1, 2, ..., ”, Satisfying (2.5) that do not 
differ greatly from the initial weights, where difference is 
a function of the initial weight. The Huang algorithm 
attempts to compute weights w; satisfying 


(1 — M) max w;7;7' < (1 + M) min w,7;"/, 


l<i<n l<isn 


where the parameter M,0 < M < 1, is specified by the 
user and is generally chosen in the interval [0.8, 1.0]. If 
the first round regression weights defined by (2.4) do not 
satisfy the requirements, a second round of regression 
weights is computed. The second round weights are 
weighted regression weights in which a control factor is 
assigned to each observation. Small control factors are 
assigned to observations with large or small first round 
weights. Relatively large control factors are assigned to 
observations with first round weights close to x;_'. The 
second round regression weights are checked and if they 
fail to satisfy the criteria, the control factors are modified, 
and so on. The algorithm is given in the Appendix. 

The control weighting used in the Huang algorithm has 
much in common with bounded-influence and robust 
regression methods. That is, in the final estimator, the 
contribution to the estimation of the slope vector is reduced 
for observations that are far from the mean. See Hampel 
(1978), Krasker (1980), and Mallows (1983). Recent 
research on this type of estimator for survey samples is that 
of Deville and Sarndal (1992), Akkerboom, Sikkel, and 
van Herk (1991), Hulliger (1993) and Singh (1993). 

It is not always possible to construct weights meeting 
the criteria and also satisfying (2.5). For example, if all 
of the observations on x;, exceed the mean, there is no set 
of positive weights summing to one that also satisfy 
vy fe1 X12 W, = X>. Therefore, the weight generation 
program will terminate if weights meeting the specified 
criteria cannot be constructed after a specified number 
of iterations. 

In some situations it is desirable to restrict the weights 
to the nonnegative integers. This is true when estimates of 
totals are being constructed and the population contains 
well defined units, such as people. Nonnegative integer 
weights then provide more comfortable estimates, in that 
the estimates are physically attainable. Integer weights can 
be constructed so that no rounding is necessary when 
building tables. With such integer weights, all multiple way 
tables will automatically be internally consistent. 

The Huang program contains an option to round the 
real weights to integer weights in a manner that maintains 


al 


the sum of the weights. After rounding, the equalities (2.5) 
will generally no longer hold exactly. We have found that 
by iterating the Huang algorithm using the first-round integer 
weights as initial weights, integer weights can be constructed 
such that there is a very modest deviation from equality 
for expression (2.5). Cox (1987), Cox and Ernst (1982), 
and Fagan, Greenberg and Hemmig (1988) discuss rounding. 


3. REGRESSION ESTIMATION WITH 
NONRESPONSE 


The early theoretical developments for regression esti- 
mation assumed the sample to be a probability sample 
from the population. However, it has long been recognized 
that regression estimation can be used to reduce the bias 
that arises from imperfections in the data collection pro- 
cedure. The most obvious of these imperfections is 
nonresponse. In all large samples of human subjects, some 
of the subjects fail to provide information. If the non- 
respondents differ from the respondents, direct estimates 
constructed from the respondents will be biased. Given 
auxiliary information, regression estimation provides a 
method or reducing the bias. The degree to which the bias 
is reduced depends upon the relationship between the 
control variables, the variables of interest, and the response 
probabilities. See Little and Rubin (1987) for a general 
discussion of these issues. 

Let z;* denote the inclusion probability equal to the 
product of z; and the conditional probability of observing 
the unit given that the unit is selected. Then 


N 
ye Mae EX: (3.1) 


i=] 


ef Vy xj mj 'Xx; | ev} 


i=] 


and 


N 
Ee Ey) 


i=] 


ef y xj mj 'y; | tx] 


i=] 


where the expectations are conditional on the given finite 
population £,,, and n is the realized sample size. In the 
case of nonresponse, the ratio p; = 7; 7; ! is the response 
probability for individual 7. Therefore, under conditions 
such as those used by Fuller (1975), 


plim(é — y) = 0, (3.3) 


n—o 


N— co 


where 8 is defined in (2.2) and 


N te 
7 = ( y sini Mi) ss x) jt; Y;. (3.4) 
i=1 


i=1 
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Then 
Y= XX, + A, (3.5) 


where A = N~!y 1a; and a; = y; — x;y. Thus, the 
regression estimator (2.1) will be a consistent estimator of 
Yif plimy— oA = 0. The probability limit of A will be 
zero if the finite population is a random sample from an 
infinite population in which the linear model 


yor x Biche vEbep es 0 


holds for all 7. 


The mean A is zero when 7; = 7; for all i and an ele- 
ment of x; is one for all i because then 


N ay sy 
y=B= ( ye xix) Da? (3.6) 


i=1 i=] 


and y®,(y; — x;8) = 0. A sufficient condition for A 
to be zero is the existence of a row vector c such that 


CH dag mad) 4s (3.7) 
fori = 1,2, ..., N. Thus, if the ratio of nominal prob- 
abilities to true probabilities is a linear function of the 
control variables, the regression estimator is a consistent 
estimator of the mean of y, where the limit is for sequences 
as defined in Fuller (1975). One way in which (3.7) can be 
satisfied is for the elements of x; to be dummy variables 
that define subgroups and for the response probabilities 
to be constant in each subgroup. This situation is sometimes 
described by saying that elements are missing at random 
in each subgroup. We take the assumption that A = Oas 
our working assumption in the empirical analysis. 

In any regression problem, it is impossible to use the 
sample to verify some of the assumptions. For example, in 
ordinary least squares regression, the residuals é; = y; — x8 
are uncorrelated with x; and, hence, the residuals cannot 
be used to check the assumption that the true errors are 
uncorrelated with x. Thus, in a survey with nonresponse, 
one searches for control variables that are correlated with 
y and (or) that one believes are correlated with the response 
probabilities. But one cannot guarantee that all bias has 
been removed by regression estimation based on a partic- 
ular set of control variables. 

In practice, one can often identify x-variables that are 
correlated with the probability of response and (or) corre- 
lated with the y-variables. For example, in the 1987-1988 
Nationwide Food Consumption Survey, the response rate 
was low among high-income households. Therefore, use 
of variables for household income in a regression 
estimator is expected to reduce the bias in the estimated 
mean for characteristics that are correlated with income. 

The error in @ as an estimator of 7 can be approximated 
by 


n 
B ay (ESN PE Dy or ay ie 


i=] 


where q; is defined in (3.5), 


N 

tik 

i= MD Tj Vr 
i=] 


and 
UNG 
Ge Tee Ds Xf mem N 
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Under reasonable assumptions 
and 


are consistent estimators of JT and G. Thus, the variance 
of the regression estimator can be estimated by estimating 
the variance of ¥ %_,x/a; 'a;. If we assume that the 
conditional probabilities of response in one primary 
sampling unit are independent of those in all other primary 
sampling units and that at least one observation unit is 
observed in each selected primary sampling unit, then (2.8) 
remains an appropriate estimator of the variance of the 
regression estimated mean of y. 

The estimator of variance (2.8) also remains appropriate 
if the regression weights are constructed by a procedure 
other than (2.4). For example, let the weights be defined by 


n 21 
as =i =i 
Weis | ‘> Xj Tj a Xj Kin Sia 


i=1 


where the g; are functions of the x;. Assume 
plim 6, KS 
where 


n =| n 
By = yy si mai| >: xj mj! 81 Yi- 


i=1 i=1 


Also assume 


N 
plim N~' Se @yr sey gens 


i=1 


Then expression (2.8) with w,; replacing w, is a consis- 
tent estimator of the variance of the estimator. The 
estimator (2.8) will be used in our empirical analyses. 
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Formula (2.8) identifies the two effects of regression 
estimation on the variance of an estimated mean. The 
correlation effect reduces the variance of the estimated 
mean while the increase in the sum of squares of the 
weights increases the variance of the estimated mean. To 
understand these effects, consider a simple random sample. 
If the y variable is correlated with x, the correlation tends 
to reduce the variance of the regression estimator relative 
to that of the simple estimator because 


ECG, — Xp) 1 = Ell yee 2 (y,) 1"). 


If the sample means of the control variables differ from 
the population means, then 


nA 
Spyies 
i=1 
where n ~! is the sum of squares of the simple weights for 
a simple random sample. 

When comparing the variance of the sample mean with 
the variance of the regression estimator, one should not 
forget that one of the reasons for using regression esti- 
mation in samples with nonresponse is to produce an 
estimator with less bias than that of the direct estimator. 
Thus, in the next section we compare an estimator of the 
mean square error of the simple estimator to an estimator 
of the variance of the regression estimator. 


4. APPLICATION TO THE NATIONWIDE 
FOOD CONSUMPTION SURVEY 


The 1987-1988 Nationwide Food Consumption Survey 
was conducted by the Human Nutrition Information 
Service of the U.S. Department of Agriculture. The orig- 
inal sample was a self-weighting stratified sample of area 
primary sampling units within the 48 conterminous states. 
Primary sampling units were divided into secondary units 
called area segments. Households within the sample 
segments were contacted by personal interview. The field 
operation was conducted during the period April 1987 
through August 1988 by a contractor under contract to the 
Human Nutrition Information Service. 

Approximately 37% of the housing units identified as 
occupied provided complete household food use informa- 
tion. The realized household sample contains 4,495 house- 
holds. Because of the low response rate, the Human 
Nutrition Information Service decided to use regression 
weighting in the estimation. Population totals for all 
characteristics except urbanization were estimated by the 
Human Nutrition Information Service from the March 
1987 Current Population Survey. See Bureau of the Census 
(1987). The population totals for urbanization classes 
were furnished by the contractor. In our analysis, we treat 
the estimated population totals as if they were known 
population totals. 
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Table 1 
Sample and population characteristics of households 
Household Household Pp lation 
Characteristic Category Sample Sample eter 
Percent 
Frequency Percent 
Season Spring 1,828 40.7 25.0 
Summer 678 Weal 25.0 
Fall 717 16.0 25.0 
Winter re272 28.3 25.0 
Region Northeast 905 20.1 21.2 
Midwest 1,172 26.1 24.7 
South 1,567 34.9 34.4 
West 851 18.9 19.6 
Urbanization Central Cities 1,064 23a 31.2 
Suburban PESOS 47.2 46.0 
Nonmetro 1,309 29.1 22.9 
Household Income < 131% 1,041 2aee 20.0 
as % of Poverty 131-300% 1,564 34.8 32.2 
301-500% 1,108 24.6 25.9 
> 500% 782 17.4 21.8 
Household Food YES 314 7.0 7.4 
Stamps No 4,181 93.0 92.6 
Ownership of Yes 2,998 66.7 64.1 
Domicile No 1,497 33.3 35.9 
Race of Household Black $19 11.5 Miter! 
Head Nonblack 3,976 8.5 88.9 
Age of Household <2 338 ies 7:9 
Head 25-39 1,588 35.3 36.1 
40-59 1,369 30.5 30.5 
60-69 660 14.7 13.0 
70+ 540 12.0 12.6 
Household Head Both Male and 3,057 68.0 60.8 
Status Female 
Female Only 1,044 232 26.0 
Male Only 394 8.8 13.2 
Female Head Yes 1,792 39.9 41.5 
Worked No 2,703 60.1 58.5 
Exactly One Adult Yes 1,211 26.9 29:7; 
in Household No 3,284 Teil 70.3 
Exactly Two Adults Yes 2,616 58.2 54.2 
in Household No 1,879 41.8 45.8 
Presence of Child Yes 1,009 22.4 20.1 
< 7 Years Old No 3,486 71.6 79.9 
Presence of Child Yes 1,309 29.1 26.5 
7-17 Years Old No 3,186 70.9 73.5 
Household Size (Mean) 27/31 2.642 
Household Size, (Mean) 9.546 9.125 


Squared 


Characteristics of the population and of the household 
sample are given in Table 1. The original sample was 
unbalanced with respect to time of interview with nearly 
41% of the interviews in the spring quarter and about 16% 
of the interviews in each of the summer and fall quarters. 
Interviews for the spring and summer quarters were done 
in both 1987 and 1988. 

The sample was also unbalanced with respect to urban- 
ization. There was a lower fraction of central city house- 
holds in the sample than in the population (24% versus 
31%), and a higher fraction of nonmetropolitan households 
in the sample than in the population (29% versus 23%). 
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The fraction of high income households was smaller in 
the sample than in the population. The sample contained 
a higher fraction of households with both a male and 
female head than the population (68% versus 61%). A 
higher fraction of the sample than of the population 
consisted of households with children. The sample was 
only mildly unbalanced with respect to several other socio- 
demographic characteristics. 

The characteristics listed in Table 1 are believed by the 
staff of the Human Nutrition Information Service to be 
related to food consumption behavior. Therefore, variables 
based on those characteristics were used in the regression 
weighting procedure. To implement the weight generation 
program, each of the categorical variables of Table | was 
converted to a set of indicator variables. For example, 
three variables were created for the characteristic, house- 
hold income as a percent of poverty. These were 


Zi, = 1 if income <131% for t-th household 
= 0 otherwise, 


Z:2 = 1 if income is 131-300% for ¢-th household 
= 0 otherwise, 
Z,3 = 1 if income is 301-500% for ¢-th household 


= 0 otherwise. 


Using this procedure, 25 indicator variables were created. 
In addition, household size and the square of household 
size were used as continuous variables. 

The twenty-seven variables were used to generate 
regression weights using Huang’s program. The parameter 
M of the weight generation program was set equal to 0.9 
in the computation. The weights were rounded to integers, 
where each integer weight is a weight in thousands. The 
sum of the weights is 88,942, which is the number of 
households in the population in thousands. The average 
weight is 19.787, the smallest weight is 6, and the largest 
weight is 47. Thus, the largest weight is 2.38 times the 
average weight. The sum of squares of the weights is 
2,317,930. The average weight squared and multiplied by 
the sample size is 1,759,884. Thus, if a variable has zero 
multiple correlation with the 27 variables, the variance of 
an estimate computed with the weights will be about 1.32 
times the variance of the simple unweighted estimator. 

Figure 1 shows the regression weights computed by 
the Huang algorithm plotted against the ordinary least 
squares weights. Because there are 4,495 households, 
many points are hidden. Both weights are standardized by 
dividing by the average weight. Thus, the average for 
weights coded in this manner is one. Because there are 
27 control variables used in the construction, the Huang 
weights tend to form a swarm of points about an S-shaped 
function of the original weights. If there were only one 
control variable, the points would fall on an S-shaped 
curve. The original weights for observations to the left of 
zero were negative. 


Final 


Figure 1. Plot of final weights against the ordinary least squares 
weights, both expressed relative to the average weight. 


To compare estimates constructed with weights to 
unweighted estimates, we use the variables 


Y, = adjusted total number of meals away from home 
(meals away), 

Y, = total money value of food used at home (home 
food), 

Y; = household size in 21-meal-equivalent persons (meal 
persons), 


Y, = indicator to identify housekeeping households 


(housekeeping). 


The household size in 21-meal-equivalent persons is the 
total adjusted meals eaten from household food supplies 
in the past 7 days divided by 21. ‘‘Meal persons’’ is the sum 
of two terms. The first term is the sum of the proportions 
of meals eaten at home in the interview week by each 
household member. The second term is the number of 
meals served to guests, boarders, and employees during 
the interview week, divided by 21. In other words: 


meal persons for = 2 (hy + aj) ~'hy + (21) 7'b;, 
j-th household i 


where A;; = meals eaten at home by the /-th individual in 
the j-th household during the interview week, a,;, = meals 
eaten away from home by the /-th individual in the j-th 
household during the interview week, b; = number of 
meals eaten by nonhousehold members in the j-th house- 
hold during the interview week. 

The adjusted total number of meals bought and eaten 
away from home is the sum of the proportions of meals 
eaten away from home in the interview week by household 
members, multiplied by 21. In the notation used to define 
meal persons, 
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meals away for = 21 ‘Y (hy + aj) ~ ‘aj. 
j-th household j 


The total value of food used at home is the expenditures 
for purchased food plus the money value of home-produced 
food and food received free-of-cost that was used during 
the survey week. Expenditures for purchased food were 
based on prices reported as paid regardless of the time of 
purchase. Sales tax was excluded. Purchased food with 
unreported prices, food produced at home, food received 
as a gift, and food received instead of pay were valued at 
the average price per pound paid for comparable food by 
survey households in the same region and season. 

A housekeeping household is a household with at least 
one person having ten or more adjusted meals from the 
household food supply during the seven days before the 
interview. Household food-use analyses generally consider 
only housekeeping households. 


Table 2 
Properties of alternative estimators 


Une Relative 
VaGable weenie’ Weighted Differ- Efficiency 
Mean ence of 
Mean : 
Regression 
Meals away 
Housekeeping ifs 7.93 —0.18 Dare 
(0.22) (0.17) (0.09) 
Nonhousekeeping 18.31 18.12 0.19 0.92 
(1.14) (1.19) (0.68) 
All 8.27 8.57 — 0.30 2.56 
(0.22) (0.22) (0.12) 
Home food 
Housekeeping 61.10 59.56 1.54 3.65 
(1.14) (0.98) (0.41) 
Nonhousekeeping 25299 26.39 — 0.40 O73 
(225) (1.46) (1.00) 
All 59:37 57.49 1.88 5.60 
(1.12) (0.91) (0.39) 
Meal persons 
Housekeeping 2.42 2.33 0.09 89.00 
(0.03) (0.01) (0.01) 
Nonhousekeeping 0.51 0.49 0.02 1.00 
(0.03) (0.03) (0.02) 
All Fiiss}3) Ppa) 0.11 129.00 
(0.03) (0.01) (0.01) 
Housekeeping (%) 95.06 93.77 129 5.30 
(0.40) (0.58) (0.10) 


The means of the variables computed using unweighted 
data are given in Table 2 in the column headed, ‘‘Un- 
weighted mean’’. Three means are given for meals away, 
home food, and meal persons. Two means are computed 
for the two subpopulations defined by the housekeeping 
variables. The third mean, designated by ‘‘all’’ in the table, 
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is the estimated mean for the entire population. The stan- 
dard errors of the estimates are given in parentheses below 
the estimates. The estimates and standard errors for the 
unweighted estimates were computed in PC CARP. See 
Fuller et a/. (1986). The computations accounted for the 
fact that the sample is an area stratified cluster sample. 

Because the sample is a two-stage sample, the estimated 
variances are larger than the variance of a simple random 
sample containing the same number of households. The 
ratio of the variance for a sample estimate to the variance 
of a simple random sample containing the same number 
of individuals is called the design effect. The estimated 
design effect is about 2.5 for meals away and meal persons, 
is about 4.1 for home food, and is about 1.5 for house- 
keeping for the ‘‘all’’ means for the unweighted sample. 

The column headed ‘‘Weighted mean’’ contains the esti- 
mates computed with the regression weights. The standard 
errors were computed in PC CARP using formula (2.8) 
with the regression weights replacing the 7; _'. The variance 
calculation requires computing a regression for every 
y-variable. The estimated means for the subpopulations 
are ratios of regression estimators. The variances for the 
subpopulation means were computed by calculating the 
variances of the Taylor deviates for the ratio using formula 
(2.8). The standard errors for unweighted and weighted 
estimates are similar for meals away and home food. How- 
ever, the standard errors for the regression estimate of the 
population mean of meal persons is about one third of the 
standard error of the unweighted estimate. The standard 
error of the regression estimator is smaller because meal 
persons is highly correlated with the household size vari- 
ables used as controls in the regression procedure. 

The estimated squared multiple correlation between the 
variables of the table and the 27 control variables is 0.29, 
0.44, 0.82, and 0.12 for meals away, home food, meal 
persons, and housekeeping, respectively. If the sample 
means of the control variables were nearly equal to the popu- 
lation means, the standard error of the regression estimate 
of meals away would be about (1 — 0.29)” = 0.84 times 
the standard error of the unweighted estimate. In fact, the 
estimated standard error of the regression is about 0.97 
times the standard error of the unweighted estimate. The 
difference is due to the fact that 5 ”_, w? is considerably 
bigger than n~! because the sample is unbalanced on a 
number of items. Note that 


O97 = 60:7) (1.32) 1, 


where 0.71 = (1 — 0.29) is one minus the squared corre- 
lation and 1.32 = ny", w?. The situation for house- 
keeping is more extreme. The correlation is not large, and, 
apparently, the large deviations from the regression line 
are associated with large weights. The estimated variance 
for the regression estimator is about twice the estimated 
variance of the unweighted estimator. 
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Table 2 also contains the estimated differences between 
the unweighted and weighted estimators. The difference 
between the unweighted and the weighted estimated total is 


n n n 
Yar Typ) wp yay a eageNe Saaw = 


t=1 t= t= 


The difference between the estimated means is the differ- 
ence between the totals divided by the population size. 
To compute the variance of the difference between the 
means, we note that the hypothesis of a zero difference is 
equivalent to the hypothesis that the correlation between 
w, and y, is zero. Therefore, we computed the unweighted 
regression of y, on w, and computed the variance of the 
regression coefficient under the design using PC CARP. 
The standard errors for the difference in Table 2 are such 
that the ‘‘¢-statistic’’ for the hypothesis of zero difference 
is equal to the ‘‘f-statistic’’ for the coefficient of w, in the 
regression of y, on w,. 

For all four characteristics, the difference between the 
weighted and unweighted estimators of the population 
mean is significant at traditional levels. Thus, under the 
assumption that the regression estimators are unbiased, 
there are significant biases in the unweighted estimators. 

The bias picture is mixed for the estimates of the sub- 
population means. The difference between the two esti- 
mators is significant for the three means for the house- 
keeping subpopulation, which is the population of interest. 
The difference is nonsignificant for the three means for 
the nonhousekeeping subpopulation. The sample contains 
only 222 nonhousekeeping households. Therefore, the 
variance of the difference between the weighted and un- 
weighted estimates is much larger for the nonhousekeeping 
households than for the housekeeping households. 

The differences between the two estimates of the popu- 
lation means are a function of the differences between the 
two estimates of the subpopulation means and the two 
estimates of the fraction of households in the two categories. 
This explains why the difference for ‘‘all’’ can be larger 
than both the ‘‘housekeeping’’ and ‘‘nonhousekeeping’’ 
differences. 

The last column of Table 2 contains the ratio of the 
estimated mean square error of the unweighted estimator 
to the variance of the regression estimator. The estimated 
mean square errors for the unweighted estimators were 
computed as 


MSE, = V + max{0, (Diff)? — (s.e. diff)?}, 


where V is the estimated variance of the unweighted esti- 
mate, Diff is the difference between the two estimates from 
Table 2, and s.e. diff is the standard error of the difference 
from Table 2. The estimated variance V for the unweighted 
estimator is variance formula (2.8) with constant wy;,, 


and with xj; B replaced by ¥,_. The second term of the 
estimated mean square error is the estimated squared bias. 
Under the assumption that the regression estimator is 
unbiased, the expected value of (Diff)? is the squared bias 
plus the variance of the difference. Hence, under the 
assumption that the regression estimator is unbiased, the 
estimated mean square error of the unweighted estimator 
is a consistent estimator. The estimated mean square errors 
of the weighted estimators are the variances of the weighted 
estimators computed as the squares of the standard errors 
of Table 2. 

Of the four characteristics for which the population 
mean was estimated, the estimated relative efficiency of 
the regression estimator to the simple mean ranges from 
2.5 to 129. The regression estimator for meals away has 
the smallest estimated efficiency. The variances of the two 
estimators are similar, but because of the estimated bias, 
the regression estimate for meals away is estimated to have 
a mean square error that is about 40% of that of the un- 
weighted estimate. The mean square error of the regression 
estimate for home food is less than 20% of that of the 
unweighted estimate, that for meal persons is about 1% 
of that of the unweighted estimate, and that for house- 
keeping is about 20% that of the unweighted estimator. 
In all cases, the squared bias is a very important component 
of the estimated mean square error. 

Because the unweighted subpopulation estimates for 
the nonhousekeeping households showed little bias, the 
unweighted estimates are estimated to be somewhat more 
efficient than the regression estimates. The nonhouse- 
keeping subpopulation is only about 6% of the population 
and the deviations from the subpopulation mean show 
little correlation with the control variables. On the other 
hand, the regression estimates for the housekeeping sub- 
population are estimated to be much more efficient than 
the unweighted estimates. The relative efficiencies for the 
housekeeping subpopulation are close to those of the total 
population estimates. 

Even after allowing for the fact that the population totals 
from the Current Population Survey are not known popula- 
tion totals, it is clear that large gains are associated with 
regression estimation for the population means. Although 
the regression estimator for the means of the small sub- 
population is estimated to be less efficient than the un- 
weighted estimators, the loss in efficiency is small relative to 
the large gains in efficiency estimated for the other variables. 
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APPENDIX 
WEIGHT GENERATION PROGRAM 


In this appendix, we present the regression weight 
generation procedure of Huang and Fuller (1978). The 
procedure we describe contains the option of specifying 
maximum and minimum weights. This option was not part 
of the original program. For a discussion of related weight 
generation procedures, see Singh (1993). 

Suppose that the population means (X\, X>, ..., X;) 
of the & auxiliary variables (X,, X>, ..., X;) are known. 
Let a sample of 1 observations be available and let 


X= AMG GR bis. (A.1) 


where X;; is the observation on variable / for individual i. 


In addition to the array of sample observations and the 
populations means, two initial factors v;and g{, i = 1, 
2, ...,N, are required to initiate the computations. The 
vy; are typically inversely proportional to the probabilities 
of selection. The default values for g/° are g{°) = 1. For 
stratified samples or data with unequal variances, the user 
may choose other values for g{°. (See Huang 1978 or 
Goebel 1976.) The program input includes the sample size 
n, the population size N, the parameter M, the maximum 
number of iterations LI, the upper bound of the ratios of 
weights to the average weight Up, and the lower bound 
of the ratios of weights to the average weight Lp. It is 
required that0 < Lg < 1 < Ug. In our description, we 
assume ) 7_, v; = n. The program normalizes the v; so 
that the sum is n. 

The program can be used to construct weights to 
estimate means or to estimate totals. The weights for totals 
are the weights for the means multiplied by NV. For means, 
the program attempts to construct weights w; such that 


yy wi(1,X)) = (1X), (A.2) 


i=1 


Lp < NW; < Up, (A.3) 


(1 — M) max w;v; <= (1 + M) min w;v;, (A.4) 


l<isn l<i<n 


Otel) Aes Ie 


The program is iterative, where an iteration consists of 
computing the generalized least squares weights, where a 
control factor h; is applied to each observation. The h; is 
a product of v; and g;, where g; for iterations after the 
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first is a ‘‘bell’’ shaped function of the distance (in a 
suitable metric) that the observation is from the population 
mean. At each iteration, the weights satisfy (A.2) but may 
fail (A.3) or (A.4). 

It will not always be possible to construct weights satis- 
fying the specified restrictions in the specified number of 
iterations. If the sample is such that the restriction cannot 
be met, the program outputs the weights of the last itera- 
tion. In the single x case, when the criterion cannot be 
satisfied, there will be two weights, one for those greater 
than the population mean, and one for those less than the 
population mean. 


To describe the algorithm, let 


Ziy = Xj — Xj, 
Zi Zi eee Zip 
Ze ee oh: 
Fae PLAN ican Ss 


IS iy, Sys: 
A) = Z'HZ, 
Gia diapg wade Gaon) 
and 
HOG 


The algorithm consists of iterating three steps. 


1. The initial calculation is fora = 0. At iteration a, the 
vector of regression weights, denoted by w'®’, is 


Wao ec ook Laka pidy elvis ding uh teatloce ) 
SG UN 


where 


a Ge Kee Xe Ug rg tl 


n Sy ti 
#=(L 8) Dv 


i=] i=1 


(A ‘™)! is a symmetric generalized inverse of A“, 
nuk = max{J/ Vu, n—' — 1}, (A.6) 


and 


AM eZ STZ 
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2. The weights of Step 1 are checked to see if they satisfy 
the criteria. 


(a) Is | nug® | < M for alli? 
(b) Is 
Lp <= nw{? = Up, 
for all 7? 


If either (a) or (b) fails for any 7 and LI iterations 
have not been completed, go to Step 3. If (a) and (b) 
are satisfied, or if LI iterations have been completed, 
the weights are output. 


3. The control factors he, 1 = 17252. 7 are modified: 
Set 


H® = He-YGO 


where 
GeMadias (el feo es ee), 
gf = 1 Ons O05 


= 1 — 0.8(d{” — 0.5)? Ost ad ni 


== 0.8(d§%) 7! dio) > I, 


di” = L3espperb4 —ly | Teo) is 


Dee) = min(M,cie-} if wf") < y, 


= min{M,CG—")} ii wies)) =e; 


Cie eanmax (| We uGlet ier) Loe al Oui 
Cer maxi ly (ae) Cen OLIns ai 


Go to Step 1 to compute new regression weights. 


The constant 1.33 in the definition of d‘*) and the con- 
stant of 0.8 in the definition of g‘“ were chosen to speed 
convergence. The control factors g{*’ are chosen to 
downweight observations on the basis of a distance from 
the population mean. 


The definition of w‘®) in (A.5) is an alternative way of 
writing the vector of generalized least squares weights of 
(2.4) when z;! = Af. 
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Estimating the Rate of Rural Homelessness: 
A Study of Nonurban Ohio 


ELIZABETH A. STASNY, BEVERLY G. TOOMEY and RICHARD J. FIRST! 


ABSTRACT 


Recently, much effort has been directed towards counting and characterizing the homeless. Most of this work, 
however, has focused on homeless persons in urban areas. In this paper, we describe efforts to estimate the rate 
of homelessness in nonurban counties in Ohio. The methods for locating homeless persons and even the definition 
of homelessness are different in rural areas where there are fewer institutions for sheltering and feeding the homeless. 
There may also be a problem with using standard survey sampling estimators, which typically require large population 
sizes, large sample sizes, and small sampling fractions. We describe a survey of homeless persons in nonurban Ohio 
and present a simulation study to assess the usefulness of standard estimators for a population proportion from 


a stratified cluster sample. 


KEY WORDS: Biased estimator; Regression estimator; Small sample size; Stratified cluster sample; Simulation. 


1. INTRODUCTION 


When we think of the homeless, we often think of 
“street people’ and ‘‘bag ladies’’. We picture people 
sleeping on park benches, on heating grates, and in home- 
less shelters. These stereotypes of the homeless originated 
in large cities, however, and do not necessarily provide an 
accurate picture of homeless persons in rural areas. 

Many of the studies of homeless persons have been 
carried out in larger cities. For example, the 1987 Urban 
Institute Study counted homeless persons in 20 major cities 
in the U.S. Another major study by Rossi was carried out 
in Chicago. (See Burt and Taeuber (1991) for an overview 
of survey methods for these and other studies that counted 
homeless populations.) 

During the 1990 United States Population Census, a 
special attempt was made to include homeless persons in 
the population count through the S-Night (Shelter and 
Street Night) count. For this effort, a special national 
list of shelters and locations in which homeless persons 
sleep was compiled. The highest elected official of over 
39,000 rural and urban local governments was asked to 
provide a list of shelters, street locations, and open public 
locations where the homeless stay at night. The homeless 
were counted by Census enumerators during a single night, 
March 20. Note that the main goal of S-Night was to 
include homeless persons in the Census count; relatively 
little information on characteristics of the homeless is 
available in the Census data. Details on the S-Night 
procedures are provided by Taeuber and Siegel (1990). 

In contrast to surveys of homeless persons in urban 
areas and to the Census S-Night, the goal of the survey 
described here was to locate and count the nonurban 


homeless wherever they might be, and to collect informa- 
tion to describe these homeless persons. In Section 2 of this 
paper, we describe the design of the 1990 survey of rural 
homeless persons in Ohio. We present our definition of 
rural homelessness and we describe the methods used to 
locate and survey the homeless. In Section 3 we present 
our estimates of the rates of rural homelessness obtained 
using the standard estimator of a proportion from a 
stratified cluster sample. Since these estimates are likely 
to be biased, we also present the results of a simulation 
study conducted to assess the likely size of the bias. In 
Section 4 we consider a regression estimator for the rate 
of homelessness and compare the regression estimator to 
the standard estimator of Section 3. In Section 5 we present 
our conclusions. 


2. THE SURVEY 


There are 88 counties in Ohio. Of these, 13 are urban 
counties with large cities and 75 are defined as rural or 
nonurban. These 75 counties of interest include counties 
that are completely rural, counties that are not adjacent 
to urban counties and that have moderately populated 
county seats, and suburban counties that border counties 
with large metropolitan areas. 

The design used in this 1990 survey was selected to 
facilitate comparisons with a 1984 study of Ohio rural 
homeless persons (Roth ef a/. 1985). In the earlier study, 
Ohio’s counties were divided into five regions, northeast, 
northwest, central, southeast, and southwest, and a 
stratified random sample of 16 rural counties was selected. 
The 21 counties selected for the 1990 study included the 
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Figure 1. County Map of Ohio 


16 counties from the original study and one additional 
county selected at random from within each region. (We 
should note that analysis of data from the present study 
suggests that stratification of Ohio into the five regions 
is not useful for improving the estimate of rural home- 
lessness.) A map of Ohio showing the five regions, the 
urban counties, and the sampled counties is provided in 
Figure 1. 


The following is a brief description of the 1990 survey 
methodology. More detailed descriptions are given by First 
et al. (1994) and Toomey ef al. (1993). 


2.1 Survey Personnel 


A census of all homeless persons within the 21 sampled 
counties was attempted. Because there are not typically 
homeless shelters or other gathering places for the homeless 
in nonurban areas, the survey was conducted over a six- 
month period and made use of a network of advisors to 
locate the homeless. The survey period began with the first 
full week of February 1990. Homeless persons were iden- 
tified and located by a referral network within each 
sampled county. Each network was supervised by a local 
county coordinator. The principal investigators supervised 
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the county coordinators and the central office staff. They 
monitored the data collection, through bi-weekly phone 
calls and field visits, to assure uniformity and to control 
quality. 

Advisors and interviewers, selected for their knowledge 
of the counties in which they worked, identified people 
who met the criteria for homelessness. Advisors included 
church leaders, hospital staff, civic club leaders, elected 
community officials, informal community leaders, bar- 
tenders, hotel clerks, laundromat attendants, and profes- 
sional service providers such as health department staff, 
librarians, agricultural extension agents, postal workers, 
ministers, park rangers, neighborhood action groups, 
human service case workers, mental health workers, and 
law enforcement officers. One hundred interviewers 
conducted the interviews with the homeless. Interviewers 
attended a four-hour training session and were provided 
with a training manual of field guidelines. Interviews took 
place in offices, diners, motel rooms, cars, state parks, 
barns, laundromats, bars, and under railroad trestles. 
Interviewers were trained to know about available commu- 
nity resources and to make referrals for respondents who 
wanted services. In addition, interviewers had access to 
funds to offer a meal or minor assistance if necessary 
(less than $600 was spent on such purchases). Assistance 
provided through interviewers was limited so that people 
would not have an incentive to falsely identify themselves 
as homeless. 


2.2 Definition of Rural Homeless 


Screening questions were used to identify homeless 
persons. The definition of homelessness used in this study 
was necessarily somewhat different from the definition 
used for studies in urban areas. In rural areas there are 
fewer public shelters and housing alternatives specifically 
for the homeless. Respondents were counted as homeless 
if they did not have a permanent residence they considered 
home and if, on the previous night, they had slept in 
(1) limited or no shelter, (2) shelters or missions that serve 
homeless persons, (3) cheap hotels or motels when the 
actual stay or intent to stay was 45 days or less, or (4) other 
unique situations when the actual stay or intent to stay was 
45 days or less. Included in the fourth category were people 
who stayed in sheds, barns, old buses, and old trailers 
without water or power, provided the person did not own 
the property and was not paying rent to stay there. Also 
included as homeless were people who were temporarily 
staying with friends or relatives, had not been staying in 
that household more than 45 days, were not a part of the 
household, and were planning on moving out in 45 days 
or less. Persons who were staying in battered women’s 
shelters, hospitals, prisons, migrant workers camps, efc. 
were not counted as homeless unless they were leaving the 
facility and had nowhere to go. 
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Our definition of homelessness may be contrasted with 
that used in studies of homeless persons in urban areas. 
The common criteria of the definition of homelessness for 
such studies is based on the Stewart B. McKinney Homeless 
Assistance Act (1987). The Act defines a homeless person 
as ‘‘an individual who lacks a fixed, regular, and adequate 
nighttime residence and an individual who has a primary 
nighttime residence that is (a) a supervised publicly or 
privately operated shelter designed to provide temporary 
living accommodations (including welfare hotels, congre- 
gate shelters, and transitional housing for the mentally ill); 
(b) an institution that provides a temporary residence for 
individuals intended to be institutionalized; or (c) a public 
or private place not designed for, or ordinarily used as, a 
regular sleeping accommodation for human beings.’’ From 
this definition comes the notion of “‘literally homeless”’ 
as suggested by Rossi ef al. (1987). These standard defini- 
tions do not include, for example, those homeless persons 
who double up with family or friends. We did include such 
persons in our count of the rural homeless. Our analysis 
indicates that about a third of the persons counted in our 
census would not be counted under the urban definition 
of homelessness. It is not known how much counting those 
doubling up would increase estimates in urban areas. 


2.3. The Interview Period 


The use of a six-month survey period for counting the 
rural homeless is different from the typical one-day survey 
period used most often in surveys conducted in urban 
areas. In a review of seven studies of the homeless, Burt 
and Taeuber (1991) report that these studies used single 
nights, or one or two weeks as the interview period at a 
single location. Most of these studies relied on locating the 
homeless in shelters, soup kitchens, abandoned buildings, 
or similar locations. Since the homeless in rural areas are 
less likely to have shelters or soup kitchens available to 
them, they are harder to find and a longer survey period 
is recommended. 

To facilitate comparisons with single-day or single- 
week surveys, homeless persons found in this study were 
asked how long they had been homeless. Using this infor- 
mation we were able to determine the number of persons 
in the sampled counties who were homeless during the first 
week of the survey, the first full week of February 1990. 

In Section 3 we present estimates of the homeless rate 
for both the six-month period and the single week. The 
six-month rate includes anyone who met the definition of 
homelessness at any time during the six-month interview 
period. The one-week rate includes those interviewed 
throughout the six months who reported being homeless 
during the first full week of February. 

To avoid duplication of respondents over the six-month 
period, each subject was assigned a unique identification 
number which included the subject’s birth date, gender, 
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and first three letters of the last name. Only a single 
duplicate interview was found in the data base; it was 
removed from the data base. (We do not have information 
on duplicates found in the field.) Because of this control 
for duplicate counting, we feel that any bias in our data 
collection procedures would be in the direction of an 
undercount of the rural homeless. 

During the six-month interviewing period, 1,100 adults 
and 480 accompanying children were identified as homeless 
in the 21 sampled counties. 


2.4 The Survey Questionnaire 


If the responses to the screening questions indicated 
that a person was homeless, that subject was asked to 
respond to a questionnaire designed to obtain information 
about the person and his or her life experiences. Of the 
1,100 adults identified as homeless, 919 completed the full 
interview. Although the focus of this paper is on estimating 
the number of rural homeless, we will describe briefly the 
questionnaire used to collect information to characterize 
the homeless. 

The full questionnaire contained three sections. The 
first included questions on demographics and life experi- 
ences (for example, reasons for being homeless, use of 
mental health and other human services, employment 
history, drug and alcohol usage, family structure, and gen- 
eral well-being). The second section contained ten scales 
(including, for example, depression-anxiety, disorientation- 
memory impairment, and retardation-lack of emotion) 
from the Psychiatric Status Schedule developed by Spitzer, 
et al. (1970). The final section was an interview post- 
mortem which was completed by the interviewer and 
included information on where the interview occurred, 
respondent characteristics (for example, gender and 
unusual behaviors), and an assessment of the accuracy of 
the respondent’s answers. The findings from this portion 
of the study are summarized by First et al. (1994). 


3. THE ESTIMATES OF RATE OF 
HOMELESSNESS 


3.1 The Estimator 


The regional estimate of the rate of rural homelessness 
was obtained using the standard estimator for a propor- 
tion from a stratified cluster sample with unequal cluster 
sizes. In this case, the cluster is the county, the cluster size 
is the population within the county, and the strata size is 
the population within a region. The estimator is as follows: 


For the i-th region, the estimated rate of homelessness 
is r; where 


number of homeless in sampled rural counties 
in the /-th region 
total population in sampled rural counties 
in /-th region 


Then the estimated homeless rate for the state is 


¥,[7; * rural county population in i-th region], 


state = POLL ; 
be total rural county population in Ohio 


where the summation is over the five geographical regions 
shown in Figure 1. The population totals for the 75 non- 
urban counties were obtained from 1990 Census data. 

The estimated one-week and six-month rates of homeless- 
ness, given as number of homeless persons per 10,000 popula- 
tion, are shown in Table 1. 

Because the above estimator involves the ratio of two 
random variables, the number of homeless and the popu- 
lation size for sampled clusters, the estimator is biased 
(see, for example, Cochran 1977). The bias decreases as 
sample size (number of counties sampled in this case) 
increases. Since our sample size is small, we recognize that 
our estimates are likely to be biased. On the other hand, 
our sampling fraction is large because the number of rural 
counties is small. Hence, we wish to assess the likely 
amount of bias in our estimates. (Note that the small 
sample sizes could also make the standard errors given in 
Table 1 inaccurate.) 


Table 1 
Estimated Rates of Homelessness per 10,000 
in Rural Ohio 
One-Week Rate 


Area (February 4 - 
February 10, 1990) 


Six-Month Rate 
(February - July 1990) 


State 5.68 (0.99) 14.00 (2.09) 
Northeast 3.44 (0.79) 12.00 (2.19) 
Northwest Sea Gro) PTA (Seilks)) 
Central 5.85 (1.86) 12.11 (3.05) 
Southeast 6.89 (1.93) 15.90 (5.91) 
Southwest 7.25 (2.44) 16.78 (5.32) 


Note: Standard errors are given in parentheses after each estimate. 


3.2 The Simulation Study 


We conducted a simulation study to help us assess the 
likely amount of bias in our estimates. We first generated 
five ‘‘populations’’ each with counts of the homeless for 
all 75 nonurban counties in Ohio. For all five simulated 
populations, the observed numbers of homeless persons 
for the 21 sampled counties were used as the counts in those 
counties. Counts for the remaining 54 counties were gener- 
ated randomly as described below. Note that the simulated 
counts represent the six-month counts of the homeless. 
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The first simulated population was created by gener- 
ating the natural log of the rate of homelessness from a 
single normal distribution. The log of the rate was used 
because the observed rates for the 21 sampled counties 
have a highly skewed histogram but the histogram for the 
log of the rates is approximately mound shaped. The mean 
of the observed log rates is 2.465 with a standard deviation 
of 0.7154. Thus, the generated log rates of homelessness 
were randomly sampled (using the statistical package S) 
from a normal distribution with this mean and standard 
deviation. After the log rates were generated for the 
54 nonsampled counties, they were used along with the 
population counts from the 1990 Census for each county 
to obtain the simulated numbers of homeless persons for 
those counties. 

The second simulated population was created in a 
manner similar to the first except that separate normal 
distributions were used within each of the five geographic 
regions of Ohio. The means and standard deviations of 
the log rates of homelessness for the sampled counties 
within each region were used as the parameters of the 
normal distributions from which the simulated values were 
generated. Again the simulated log rates were used to 
obtain the numbers of homeless persons for the 54 non- 
sampled rural counties. 

The third simulated population was generated using the 
regression of rate of homelessness per 10,000 on the 
percent elderly in each sampled county. (This choice of 
predictor variable is based on the selection of a regression 
estimator as described in Section 4.) The fitted regression 
model is 


rate = — 9.02 + 2.32%elderly, 


with R* = 0.197, /MSE = 9.03, and p-value = 0.044 
for the overall F-test for the regression line. The simulated 
population was created by estimating the rate of homeless- 
ness in each nonsampled county from the percent elderly 
in the county and then adding a random normal error 
term. Because a plot of the residuals from the regression 
line suggested that the variance in the residuals is larger 
for counties with higher percentages of elderly, the random 
error terms were generated from two different normal 
distributions depending on whether the percent elderly in 
the county was more or less than 10%. The standard devia- 
tions used for the two normal distributions were the stan- 
dard deviations in the residuals for the counties with 10% 
or more elderly and with less than 10% elderly. 

The fourth simulated population was generated using 
the regression of rate of homelessness per 10,000 on the 
percent elderly in each sampled county and on the indicators 
of the region of the state to which the county belongs. 
Using Ing, Inw, Ic, and Isp to represent indicator 
variables for the northeast, northwest, central, and south- 
east regions respectively, the fitted regression model is 
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rate = — 10.40 + 3.23%elderly 
AGT Ige 8 Sola 8 CAL a= 14 OS ler, 


with R* = .407 (R?-adjusted = .210), /MSE = 8.73, 
and p-value = 0.127 for the overall F-test for the regression 
line. The simulated population was created by estimating 
the rate of homelessness in each nonsampled county from 
the regression equation and then adding a random normal 
error term. A residual plot again suggested that the vari- 
ance in the residuals is larger for counties with higher 
percentages of elderly. Thus the random error terms were 
generated from two different normal distributions depend- 
ing on whether the percent elderly in the county was more 
or less than 10%. Again, the standard deviations for the 
two normal distributions were the standard deviations of 
the appropriate subsets of residuals. 

The fifth simulated population was generated to be 
somewhat different from the other populations. It was 
generated using a regression model to predict number of 
homeless directly from the population size within each 
county. The fitted regression model is 


homeless = 13.23 + 0.001154population, 


with R* = 0.386, |MSE = 54.29, and p-value = 0.003 
for the overall F-test for the regression line. The simulated 
population was created by estimating the number of home- 
less persons in each nonsampled county from the fitted 
regression equation and then adding a random normal 
error term. Because a plot of the residuals suggested that 
the variance in the residuals is larger for counties with 
larger populations, the random error terms were generated 
from two different normal distributions depending on 
whether the county population was more or less than 
30,000. The standard deviations for the two normal distri- 
butions were the standard deviations of the appropriate 
subsets of residuals. 

After the five populations had been generated, they 
were each used to assess the amount of bias in the estimates 
of the rate of rural homelessness. Since we had created the 
entire ‘‘population’’, we could compute the ‘‘true’’ rate 
of homelessness within the entire state and the five geo- 
graphical regions for each of the five populations. 

In the simulation, samples of 21 rural counties were 
selected using the stratified sampling scheme that was used 
for the actual study. That is, four counties were sampled 
at random without replacement from each of the northeast, 
northwest, central, and southwest regions; five were 
sampled from the southeast region. The estimated rates 
of homelessness were computed for the five regions and 
for the state using the formulas given in Section 3.1. These 
estimates were compared to the population rates of home- 
lessness for the simulated population to determine the 
bias in the estimate. This process of selecting a sample, 
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computing estimates, and determining the bias was 
repeated | million times with replacement for each simu- 
lated population. (The number of possible samples is more 
than 7.15 x 10'°.) The same stream of random numbers 
was used to select the samples for each of the five popu- 
lations. The results of the simulation are presented in 
Table 2. 


Table 2 


Bias in the Estimate of the Homeless Rate per 10,000 
for Five Simulated Populations 
(Based on 1,000,000 simulated samples) 


Population 
1 2 3 4 5 
STATE 0.0406 0.1308 0.2618 0.2433 0.2547 
(2.056) (1.759) (2.144) (1.807) (1.605) 
REGION 
NE — 0.0406 —0.0379 0.1538 0.0317 0.0993 
(3.333) (2.923) (3.748) (4.034) (1.937) 
NW —0.0578 —0.2948 0.0529 0.0254 0.3234 
(3.591) (3.194) (3.474) (3.460) (4.249) 
€ — 0.2442 0.2700 0.3974 0.1362 0.1901 


(3.122) (3.762) (3.426) (2.260) (2.869) 
SE —0.1034, —0.0279 —0.1132) —0.1798" 0:0427 
(6.512) (4.298) (6.600) (3.892) (3.973) 
SW 0.6184 0.8093 0.9196 1.277 0.6716 
(4.215) (4.990) (4.610) (5.173) (4.274) 


Note: The standard deviation of the simulated sampling distri- 
bution of the estimator is given in parentheses below each 
value. 


From Table 2 we see that the size of the bias in the 
overall state estimate of homelessness is about 1/100th of 
the size of the estimate itself. (Recall that the actual esti- 
mated six-month rate of homelessness for the state is about 
14 per 10,000 population. The simulated populations have 
state rates between about 13 and 15 per 10,000.) At the 
regional level, the size of the bias is also about 1/100th of 
the size of the regional estimates even though the regional 
estimates are based on much smaller sample sizes. These 
results suggest that the size of the bias in our actual esti- 
mate is likely to be relatively small. 

As would be expected from the small number of counties 
in the sample, the variance of the sampling distribution 
of the estimator is fairly large. The standard deviation in 
the estimates from the simulation study was about 10 times 
the size of the bias. (The standard deviations of the 
1,000,000 estimates in each of the five simulations are of 
the same order of magnitude as the standard error of the 
actual estimate shown in Table 1.) This result suggests 
that the bias in the actual estimate is likely to be rather 
unimportant when compared to the standard error of the 
estimate. 


Finally, we assessed the shape of the sampling distri- 
bution of our estimator by looking at histograms of the 
1,000,000 estimates from each of our five simulation 
studies. The histograms appeared symmetric, mound 
shaped, and remarkably like histograms of normal data. 
Thus, confidence intervals based on the normal approx- 
imation are likely to be fairly accurate. 


4. A REGRESSION ESTIMATOR 


There is a great deal of information available, for 
example from the Bureau of the Census, on the economic 
conditions in a county. We hoped to be able to use some 
of this information to improve our estimate for the rate 
of homelessness by using a regression estimator. To this 
end, we searched for a regression model relating either the 
number of homeless persons in a county or the rate of 
homelessness with a variety of predictor variables which 
we thought might be useful in explaining homelessness. 
These possible predictor variables included county popula- 
tion, percentage change in population from 1980 to 1990, 
unemployment rate, percent elderly, public welfare expen- 
ditures, average weekly earnings, percent of rental prop- 
erty, median rent, poverty rate, percent female head of 
household, percentage of land in farming, average value 
of farms, average income per farm, ratio of manufacturing 
to farm jobs, indicator of Beale scores - a classification 
system for degree of ruralness (see Thomas 1977), and 
regional indicators. 

None of these possible predictors individually or in 
combination provided a good predictor of the number of 
homeless persons or rate of homelessness. The best single 
predictor was percent elderly, the model which was used 
in generating the third simulated population described in 
Section 3.2, but it explained less than 20% of the variability 
in the rate of homelessness. No other variable was useful 
in addition to percent elderly and we could not find another 
reasonable regression model. Thus we used percent elderly 
in a regression estimator for the state rate of rural home- 
lessness. Note that percent elderly is a plausible predictor 
of the rate of homelessness because poor economic con- 
ditions in a rural county appear to result in out-migration 
of the young; the elderly remain behind making up a 
greater proportion of the population. Therefore, it is 
possible that the percentage of elderly in a county is a 
proxy for poor economic conditions and out-migration. 
We cannot, however, rule out the possibility that percent 
elderly appears to be related to rate of homelessness in our 
data due to chance. We also realize that unavoidable errors 
in the county-based data collection procedures, such as 
interviewer effect, amount of services available, and geo- 
graphic factors, may contribute to the lack of association 
between rate of homelessness and theoretically relevant 
variables. 
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We used the combined regression estimator (see, for 
example, Cochran 1977) to obtain the state estimate of 
14.85 rural homeless per 10,000 with a standard error of 
1.64. This compares with the original estimate of 14.00 
with a standard error of 2.09 as shown in Table 1. Because 
the regression estimator is also biased with the bias de- 
creasing for larger sample sizes, we again used a simulation 
study to assess the bias in this regression estimator. 

The simulation study for the regression estimator was 
carried out using the third and fourth simulated popula- 
tions described in Section 3.2 because those populations 
were generated using a regression model involving percent 
elderly. The simulation again computed the bias in the 
estimate for 1 million stratified cluster samples chosen with 
replacement from each population. The same stream of 
random numbers was used to generate the samples in both 
cases. A summary of the results of the simulation study 
for both the original estimator and the regression estimator 
is given in Table 3. 


Table 3 


Comparison of Estimators of State Homeless Rate per 10,000 
(Summary for 1,000,000 repetitions from 
two simulated populations) 


Original Regression 
Estimator Estimator 
Population Population 
3 4 5 4 
Average Bias 0.2618 0.2433 1.7115 0.8360 


Standard Deviation 2.144 1.807 1.820 1.246 
MSE 4.664 3.325 6.242 2.250 


Note that the average bias is larger for the regression 
estimator than for the standard estimator for a rate from 
a stratified cluster sample. The standard deviation of the 
sampling distribution for the regression estimator, how- 
ever, appears to be slightly smaller than that of the original 
estimator for each of the two simulated populations. The 
mean squared errors for the regression estimator fell above 
and below those of the original estimator. Thus, the choice 
of which estimator to use was unclear from the summary 
information in Table 3. 

Because the regression estimator does not provide a 
clear improvement over the original estimator, the bias on 
average appears to be larger for the regression estimator, 
and the percent elderly variable may have been selected out 
of the many variables we tried due to chance, we chose to 
use the standard estimator of Section 3 for estimating the 
rate of rural homelessness. 
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5. CONCLUSIONS 


The most often quoted national figures on homelessness 
were published by Burt and Cohen (1989) who estimated 
rates of homelessness in urban areas at 37.4 per 10,000 popu- 
lation in cities of more than 100,000 and 9 per 10,000 out- 
side of SMAs. This current study of homeless persons in 
nonurban Ohio gives a six-month rate of about 14 home- 
less per 10,000 population and a one-week rate of 5.68 per 
10,000 population. 

The results of our simulation study suggest that the bias 
in the usual estimate of a rate based on our small cluster 
sample is not likely to be important, particularly in com- 
parison to the size of the standard error of the estimate. 
The bias in the estimates for the five geographic regions 
in Ohio was found to be of a similar, relatively small size. 
The simulation study suggests that statistical biases and 
errors are not likely to discredit the substantive results of 
the survey of rural homeless. 

Our regression analysis of the numbers of homeless 
persons from sampled counties suggests that it is difficult 
to explain the numbers of homeless persons in nonurban 
counties using economic and demographic variables that 
might be thought to be related to homelessness. It may be 
that each county is so different from the others, because 
of its location relative to population centers and related 
economic characteristics, that it is impossible to find a 
suitable stratification of the nonurban counties within 
Ohio. The use of a geographically stratified sample in Ohio 
did not appear to reduce the variance of the estimate and 
no other stratification variable was suggested by our 
regression analysis. This may be the case for other states 
as well, although stratification by some variable may be 
possible over, say, the entire United States. 
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In This Issue 


This issue of Survey Methodology opens with a special section on Establishment Survey Methods. 
The four papers in this special section deal with important issues in the context of establishment 
surveys such as questionnaire design, sample design and estimation. These papers were initially 
presented at the International Conference on Establishment Surveys, Buffalo, New York, June 1993. 

The paper by Armstrong and St-Jean presents an application of the general framework of 
regression estimation in two-phase sampling. Using data from a two-phase sample of tax records, 
three particular cases of the generalized regression estimator - two regression estimators and a 
poststratified estimator - are empirically compared to the Horvitz-Thompson estimator. The 
empirical study shows that the poststratified estimator is more efficient than the Horvitz-Thompson 
estimator and as efficient as the two regression estimators. 

Gallego, Delincé and Carfagna describe the Monitoring Agriculture with Remote Sensing (MARS) 
project of the European Community. As the project is not capable of producing good estimates of 
crop areas and yields, they describe a method of sampling farms by points to obtain reliable estimates. 
Results of applying this approach in two regions, Emilia Romagna in Italy and the Czech Republic, 
are described. 

Pollock, Turner and Brown discuss the use of capture-recapture sampling to estimate the popula- 
tion size and population totals when only incomplete list frames exist. A discussion of the properties 
of the resulting model based estimators and an example where the establishments are fishing boats 
are presented. 

In the last paper of this special section, Gower gives an overview of important considerations that 
should be taken into account when developing and designing questionnaires for business surveys. 
Examples of applications of focus groups and cognitive research to test questionnaires for business 
surveys are presented. 

Rancourt, Lee and Sarndal present simple correction factors to reduce the bias of the standard 
estimator of the population mean in the case of ratio imputation for confounded nonresponse. The 
effectiveness of these factors is studied by Monte Carlo simulations. The factors are found to be 
effective especially when the model underlying ratio imputation holds. 

The use of the capture-recapture approach for coverage evaluation of the U.S. census is discussed 
by Ding and Fienberg. They give methods for estimating population total and census undercount 
when the assumption of a perfect match between individuals in the census and in the sample is relaxed. 
They propose models to describe two types of matching errors, mismatches and erroneous non- 
matches. The methods are illustrated using data from 1986 Los Angeles test census and 1990 
Decennial Census. 

Kott discusses testing a hypothesis about linear regression coefficients using data from a sample 
survey. He suggests an adjustment of the design-based linearization variance estimator to reduce 
its model bias and a formula to estimate its effective degrees of freedom. Two examples of the method 
are presented. 

Cox develops a framework, called matrix masking, for microdata disclosure limitation methods 
that should improve understanding of these methods and of their effect on data use. Within this 
framework, based on ordinary matrix arithmetic, statistical agencies can develop, evaluate and use 
reliable software for disclosure limitation of microdata. The author presents explicit matrix mask 
formulations for the principal microdata masking methods in current use. 

Falorsi, Falorsi and Russo conduct an empirical comparison of some small area estimation methods 
in the context of the Italian Labour Force Survey using data from the 1981 Italian Census. The 
estimators included in their study are a poststratified direct estimator, a synthetic estimator, an optimal 
linear combination of the two, and a sample size dependent estimator. They conclude that, for their 
application, the sample size dependent estimator offers the best balance of variance and bias. 
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In This: Issue 


The paper by Niyonsenga presents a comparison of two nonparametric methods of estimation 
of response probabilities in sampling theory via a Monte Carlo simulation. It is shown that, in the 
context of simple random sampling without replacement, the nonparametric variant based on the 
ranks of the values of the auxiliary variable performs better, with respect to both bias and mean 
square error, than the method based on the values of the auxiliary variable, for both the expansion 
and regression estimators. 

Schabenberger and Gregoire compare alternative exact and approximate zps strategies in the 
context of sampling in forestry. Two sequential sampling schemes due to Sunter combined with the 
Horvitz-Thompson estimator are compared to the random group strategy of Rao, Hartley and 
Cochran (RHC) as well as a ratio of means estimator used with simple random sampling. If the size 
variable is highly correlated with the variable of interest then aps strategies are considerably more 
efficient. When the correlation is very high the exact aps strategy is most efficient; however, the 
RHC strategy has the advantage of simplicity. If the correlation is low then the zps strategies can 
be very inefficient. 


The Editor 
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Generalized Regression Estimation for a Two-Phase 
Sample of Tax Records 


JOHN ARMSTRONG and HELENE ST-JEAN! 


ABSTRACT 


A generalized regression estimator for domains and an approximate estimator of its variance are derived under 
two-phase sampling for stratification with Poisson selection at each phase. The derivations represent an application 
of the general framework for regression estimation for two-phase sampling developed by Sarndal and Swensson 
(1987) and Sarndal, Swensson and Wretman (1992). The empirical efficiency of the generalized regression estimator 
is examined using data from Statistics Canada’s annual two-phase sample of tax records. Three particular cases 
of the generalized regression estimator - two regression estimators and a poststratified estimator - are compared 


to the Horvitz-Thompson estimator. 


KEY WORDS: Model assisted estimation; Domain estimation; Poisson sampling. 


1. INTRODUCTION 


In this paper the problem of domain estimation under 
two-phase sampling for stratification is examined in a case 
in which Poisson sampling is used at both phases of selec- 
tion. Consider a population of N units and suppose that 
it is necessary to estimate the total of a characteristic of 
interest, y, for L disjoint domains. Domain membership 
can be well, but not exactly, predicted using an auxiliary 
variable, 0, that is not observed before sampling. The cost 
of obtaining information on 9 is lower than the cost of 
obtaining information on y and lower than the cost of 
obtaining exact domain membership data. At the first 
phase of sampling, a Poisson sample is drawn from the 
population and the value of 0 is observed for each sampled 
unit. The units in the first-phase sample are stratified using 
0-values. This stratification is an approximation to strati- 
fication by domain. At the second phase of sampling, a 
Poisson sample is drawn from each stratum. The value of 
y, as well as exact domain membership data, is observed 
for each unit in the second-phase sample. 

The Horvitz-Thompson estimator of the total of y for 
domain d is Yy_7(d) = Yiesr ¥i(d)/ (piri), where 
y;(d) takes the value of y; if unit 7 falls in domain d and 
otherwise takes the value zero, s2 denotes the second- 
phase sample and p; and p>; are first- and second-phase 
selection probabilities, respectively, for unit 7. Since the 
sample sizes obtained using Poisson sampling are random 
variables, this estimator may be inefficient. (See Sunter 
1986 or Sarndal, Swensson and Wretman 1992, p. 63.) 
Generalized regression estimation is an alternative to the 
Horvitz-Thompson estimator that can be employed when 
auxiliary information is available. A generalized regression 


estimator for two-phase Poisson sampling and an approx- 
imate estimator of its variance are derived in this paper. 

Section 2 contains the derivation of the generalized regres- 
sion estimator and approximate variance estimator. Section 3 
includes a description of the application that motivated 
the estimation problem - Statistics Canada’s annual two- 
phase sample of tax records. The results of an empirical 
study comparing the Horvitz-Thompson estimator with 
three particular cases of the generalized regression estimator 
— the poststratified estimator currently used in production 
and two regression estimators - are described in Section 4. 


2. GENERALIZED REGRESSION ESTIMATION 


Generalized regression estimation is not a new technique. 
A generalized regression estimator for a one-phase sample 
design is described by Deming and Stephan (1940). Recent 
applications of generalized regression estimation at Statistics 
Canada include the work of Lemaitre and Dufour (1987) 
and Bankier, Rathwell and Majkowski (1992). Hidiroglou, 
Sarndal and Binder (1993) provide an extensive discussion 
of the use of generalized regression estimators for business 
surveys. 

Derivation of generalized regression estimators can be 
approached from the perspective of model assisted survey 
sampling (Sarndal, Swensson and Wretman 1992) or from 
the perspective of calibration (Deville and Sarndal 1992). 
Let U = {u} and V = {v} denote sets of first-phase post- 
strata and second-phase poststrata, respectively. During 
generalized regression weighting of the first-phase sample, 
the design weights 1/p,; are adjusted to yield weights 
Wi; = 21;/p;; that respect the calibration equations 


' John Armstrong, Social and Economic Studies Division, 24 - R.H. Coats Bldg., and Héléne St-Jean, Business Survey Methods Division, 
11 -R.H. Coats Bldg., Statistics Canada, Tunney’s Pasture, Ottawa, Ontario, K1A OT6. 
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i Wixi = Xi 


iesl Nv 


for each first-phase poststratum u, where x;isan LZ, Xx 1 
vector of auxiliary variables known for all units in the 
population and_X,, is the vector of auxiliary variable totals 
for poststratum uw. The adjusted weights minimize the 
distance measure Y jes; (21; — 1)?/p,;. The same weights 
can be obtained from a model assisted perspective using 


E: (yi) = Mirleas 1€u 
Ve (yi) = O°, 


where y; is the value of the variable of interest for unit /, 
and E;(-) and V;(-) denote expectation and variance, 
respectively, with respect to the model. 

For the generalized regression estimators of interest, 
weighting of the second-phase sample involves a calibration 
procedure that is conditional on the results of first-phase 
weighting. The initial weights, w,;/p2;, are adjusted to 
give final weights, w; = g9;W;/P2;, that satisfy the cali- 
bration equations 


SD Wisi = The 


i€és2Nv 


for each second-phase poststratum v, where z;is an L, x 1 
vector of auxiliary variables known for all units in the 
first-phase sample and Z, = YijesiqyWy; Z is an estimate 
of the vector of auxiliary variable totals for post-stratum v, 
computed using the adjusted first-phase weights wy;. 
Note that these calibration equations differ in an impor- 
tant way from the examples given by Sarndal and Swensson 
(1987, pp. 284-288) and Sarndal, Swensson and Wretman 
(1992, pp. 359-366) because they involve adjusted first- 
phase weights rather than first-phase design weights. 
The final weights minimize the distance measure Yje.> 
wy ;(2; — 1)?/p>;. The model needed to obtain the same 
weights from a model assisted perspective is 


EB: (wiv) = Wi; Z/ By, 1€v 
Ve (Wid) = Wyo 


Use of adjusted first-phase weights rather than first- 
phase design weights in the second-phase calibration equa- 
tions has two important advantages. First, the generalized 
regression estimator for domain d can be written as 


Yorec(d) = Y) Vil) 81; 82i/Pii Pris 


i€s2 


using first-phase and second-phase g-weights. Second, 
suppose that some auxiliary variables are used for calibra- 
tion at both phases of weighting. Estimates of population 
totals for such variables that are equal to actual totals can 
be constructed using final weights. 


Let X, = Viesiny Xj/P1; denote the L; «x 1 vector of 
Horvitz-Thompson estimates of auxiliary variable totals 
for first-phase poststratum uw. The first-phase g-weight is 


= leet NG 


where Ay = (X, — X,)’M,' and M7! = (Yiesinu 
x;x//p,;) _'. For second-phase poststratum v, denote the 
estimate of Z, based on initial second-phase weights by 
Z, = Yiesony Wi Z;/P2i. The second-phase g-weight is 


Boe Nei 


where A. = (20 12, ) Gy And eat ena 
W122i /P2i) 


The approximate variance of Yorraq(d) is given by 


i a 
V( Yorec(d@)) = 2 —+~+ Qi, + 
F Pij 
Waalie 
a] © 2 a ( WiQD») |, 
i€és2 P2i 


where £)(-) denotes expectation with respect to the first 
phase of sampling, Q,; = y;(d) — x; B, for each unit in 
first-phase poststratum u, and B,,, the vector of estimated 
coefficients from the regression of y(d) on x that would 
be obtained if y(d) was available for all units in first-phase 
poststratum uw, is given by 


B, = ( 8 in) ds xa(d)). 
i€u i€u 


Similarly, Q); = y;(d) — z/ B, for each unit in second- 
phase poststratum v and B,, the vector of estimated coef- 
ficients from the regression of y(d) on z that would be 
obtained, conditional on the first-phase calibration, if 
y(d) was available for all units in the component of the 
first-phase sample falling in second-phase poststratum v, 
is given by 


=i 
= ( ye mitt) ( yey manntd) ). 
ieslNv esl Nv 


An estimator of the approximate variance of Yorrg(d) 
is 


V( Yorec(d)) = 
Dii P22; 


Wie 2 
De ie (21911)? 
i 


1 


yy 5 (81:82: 921). 
Ties 
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Since y(d) is available only for units in s2, estimates 
of B, and B, are 


-1 
( oS via) ( ps mixai(d)), 
ies2Nu ies2Nu 


B, ( ‘> waa) ( > wisai(d) ). 


ieés20v iés2Nv 


bo 
I 


The sample residuals needed to compute the variance esti- 
mator are q); = y;(d) — x/B,and gy = y;(d) — 2B,. 
More details of the derivation of the approximate variance 
of Yorrc(d) and the estimator of the approximate variance 
are given in Appendix A. 

If y is strongly correlated with x and z, the variance of 
the generalized regression estimator of the population total 
of y will be relatively small. However, it is important to 
note that strong correlations between y and x and z will 
not necessarily lead to a relatively small variance for the 
estimate of the total of y for a particular domain, since 
y(d) may be poorly correlated with x and z within post- 
strata that include at least one sampled unit falling in 
domain d. 

The correlation between y(d) and x and z within a 
poststratum that includes at least one sampled unit falling 
in domain d may be low if some sampled units in the 
poststratum do not fall in domain d. This situation may 
arise often if domain totals of auxiliary variables and/or 
exact domain membership information for units in the 
first-phase sample are unavailable. In the context of two- 
phase sampling for stratification, there is no domain 
membership information available before selection of the 
first-phase sample. If each first-phase poststratum is 
formed by combining one or more first-phase sampling 
strata, for example, most first-phase poststrata will include 
more than one domain. The variable © used to predict 
domain membership during stratification of the first-phase 
sample is not an exact predictor. If second-phase post- 
strata are formed by combining second-phase sampling 
strata, each domain may be divided between a number of 
second-phase poststrata. 

Depending on the type of auxiliary information used, 
the g-weights associated with the generalized regression 
estimator and, consequently, generalized regression esti- 
mates, may be negative. 


3. APPLICATION: TWO-PHASE SAMPLING 
OF TAX RECORDS 


The two-phase tax sample is part of a general strategy 
at Statistics Canada for production of annual estimates 
of Canadian economic activity. Annual economic data for 
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large businesses are collected through mail-out sample 
surveys. Data for small businesses are obtained from the 
tax sample. Estimates of financial variables for the busi- 
ness population are obtained by combining tax and survey 
estimates. Tax data rather than survey data are used to 
obtain small business estimates in order to reduce costs and 
response burden. 


The two-phase sample design was introduced in response 
to a requirement for estimates for domains defined using 
the four-digit Standard Industrial Classification (SIC) 
code (Statistics Canada 1980). The first two digits of SIC 
(SIC2) provides a classification of businesses activity into 
76 groups. Within each group, four-digit SIC (SIC4) codes 
provide classification into finer categories. For example, 
the SIC2 code of a business might classify it in the trans- 
portation industry while the SIC4 code describes the 
activity of the business as bulk liquids trucking. 


There are two types of taxfilers - Tls and T2s. A Tl 
taxfiler is an individual, who may own all or part of one 
or more unincorporated businesses, while a T2 taxfiler is 
an incorporated business. Administrative files that contain 
limited information for all taxfilers that are associated 
with businesses are provided to Statistics Canada by 
Revenue Canada, the Canadian government department 
responsible for tax collection. These files are used to 
construct a sampling frame. Information concerning 
numbers of businesses owned by T1 taxfilers and owner- 
ship shares is not available on the sampling frame. Frame 
data does include geographical information, as well as 
gross business income and net profit for both T1 and T2 
taxfilers. A few other major financial variables, including 
salary and inventory data, are generally available for T2 
taxfilers. Estimates are required for about 35 financial 
variables that can be obtained from tax returns and asso- 
ciated financial statements but are not available on admin- 
istrative files supplied by Revenue Canada. 


Taxfilers that are associated with businesses are classi- 
fied by Revenue Canada using the SIC system. In most 
cases, descriptions of business activity reported on tax 
returns are sufficient to accurately determine SIC2 codes. 
Revenue Canada assigns additional digits of SIC to most 
taxfilers. However, not all taxfilers are classified to the 
four-digit level and the third and fourth digits of SIC4 
codes assigned by Revenue Canada are relatively inac- 
curate. A two-phase approach to sampling of tax records 
was adopted to facilitate accurate estimation of economic 
production at the SIC4 level. 


Section 3.1 includes a brief description of the two-phase 
sampling design. More information about the two-phase 
design is provided in Armstrong, Block and Srinath 
(1993). Sections 3.2 and 3.3 contain information concer- 
ning estimation for the two-phase design. The Horvitz- 
Thompson estimator is described in Section 3.2 and a 
poststratified estimator is discussed in Section 3.3. 
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3.1 Sampling Design 


The administrative information used to construct the 
sampling frame for a particular tax year is accumulated 
by Revenue Canada over a period of two calendar years 
as tax returns are received and processed. The use of 
Poisson sampling offers substantial operational advantages 
because sampling operations can begin before a complete 
sampling frame is available. 

The target (in-scope) population for tax sampling 1s the 
population of businesses with gross income over $25,000, 
excluding large businesses covered by mail-out sample 
surveys. The first-phase sample is a longitudinal sample 
of taxfilers. Strata are defined by SIC2, province and size 
(gross business income). All taxfilers that are included in 
the first-phase sample for tax year J and are still in-scope 
for tax sampling in tax year J + 1 remain in the first- 
phase sample for tax year 7 + 1. Taxfilers may be added 
to the first-phase sample each year to improve the precision 
of certain estimates and to replace taxfilers sampled in 
previous years that are no longer in-scope. 

To implement Poisson sampling for first-phase sample 
selection, each taxfiler is assigned a pseudo-random 
number (hash number) in the interval (0,1) generated by 
a hashing function that uses the unique taxfiler identifier 
as input. The hash number for each taxfiler is compared 
to the sampling interval for the corresponding stratum. If 
the hash number for a particular taxfiler falls in the 
corresponding sampling interval and the taxfiler is not 
already in the first-phase sample, then the taxfiler is added 
to the first-phase sample. Since taxfiler identifiers do not 
change over time, Poisson sampling facilitates selection 
of a longitudinal first-phase sample. 

First-phase selection probabilities for taxfilers that are 
already included in the first-phase sample are updated each 
year. Longitudinal updating is necessary because: (i) a tax- 
filer may fall in different first-phase sampling strata in 
consecutive tax years; and (ii) first-phase sampling fractions 
for a given stratum may vary from one year to the next. 

Copies of tax returns and associated financial statements 
for taxfilers in the first-phase sample are sent to Statistics 
Canada from Revenue Canada. In order to select the second- 
phase sample, statistical entities are created using infor- 
mation about businesses corresponding to taxfilers in the 
first-phase sample. Let J = {7} denote the population of 
businesses that is the target population for tax sampling. 
A statistical entity, denoted by (/,/), is created for every 
taxfiler-business combination in the first-phase sample. For 
each T1 taxfiler in the first-phase sample, data for all busi- 
nesses wholly or partially owned by the taxfiler (including 
ownership shares) that are needed to create statistical entities 
are available from tax returns and associated financial 
statements. Since there is a one-to-one correspondence 
between businesses and T2 taxfilers, a single statistical entity 
is created for each T2 taxfiler in the first-phase sample. 


For each tax year, statistical entities that have not 
appeared in previous tax samples are assigned SIC4 codes 
by Statistics Canada. These codes are determined using 
information supplementary to business activity descriptions 
reported on tax returns and are more accurate in digits 
three and four than codes assigned by Revenue Canada. 
For statistical entities that have appeared in previous tax 
samples, the SIC4 assigned earlier is carried forward. 

Conceptually, the second-phase sample is a sample 
of businesses. Operationally, it is a sample of taxfilers 
selected using statistical entities. Statistical entities are 
stratified using SIC4 codes assigned by Statistics Canada, 
as well as province and size. The total revenue of business 
j is used as the size variable for statistical entity (i,/). If 
one statistical entity corresponding to a T1 taxfiler is 
selected for the second-phase sample, then all statistical 
entities corresponding to the taxfiler are selected. Conse- 
quently, the second-phase selection probability for statis- 
tical entity (7,7) depends only on /. 

Second-phase sample selection is done by the Poisson 
sampling method using hash numbers generated from 
taxfiler identifiers. The hashing function used for second- 
phase sample selection is independent of the first-phase 
hashing function. 

Data for about 35 financial variables are transcribed 
from tax returns and associated financial statements for 
taxfilers selected in the second-phase sample. SIC4 codes 
assigned by Statistics Canada are updated, if necessary, 
to ensure that all SIC4 codes used during tabulation of 
estimates correspond to the current tax year. 


3.2 Horvitz-Thompson Estimator 


The second-phase sample is a sample of businesses 
selected using statistical entities. Since some businesses are 
partnerships, more than one statistical entity may corres- 
pond to the same business. To construct estimates for the 
population of businesses, an adjustment for the effects of 
partnerships is required. If business / is a partnership, it 
will be included in the second-phase sample if any of the 
corresponding taxfilers are selected. The usual Horvitz- 
Thompson estimator must be adjusted for partnerships to 
avoid over-estimation. Let 6,; denote the proportion of 
business / owned by taxfiler 7 and suppose that statistical 
entity (/,/) is selected for the second-phase sample. The 
data for business / is adjusted by multiplying it by 6; so 
that only the component of income and expense items 
corresponding to taxfiler 7 is included in estimates. Rao 
(1968a) describes a similar adjustment in a slightly different 
context. 

Let y; denote the value of the variable y for business /. 
The Horvitz-Thompson estimate of the total of y over 
domain d, incorporating adjustment for partnerships, is 
given by 
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Yu—-r(d) =) Y) 6y¥;(d)/ (Pipa), 


i€s2 jeJ; 


where J; is a set containing the indices of the businesses 
wholly or partially owned by taxfiler 7. Since selection 
probabilities depend only on the taxfiler index i, Y;,_ 7(d) 
can be written as 


Yu—-r(d) =) ¥,(d)/(PuiPri), 


i€s2 


where 


yd) = Y) d;¥;(d). 


SES; 


Y,,_ 7(d@) is an unbiased estimator of the population total 
of y for businesses in domain d. Refer to Rao (1968a). 


The second-phase sample is obtained by Poisson sub- 
sampling of the first-phase Poisson sample. Consequently, 
the second-phase sample is also a Poisson sample and the 
variance of Y;,_7(d) is 


V(Yy_7(d)) = NB [( a Di OpPa iva 


i 


An unbaised estimator of this variance is 


V(Yy_-7(d)) = ys [qd = Div Pr IAB wDoa ka)s. 


i€s2 


3.3. Poststratified Horvitz-Thompson Estimator 


Adjustment of the Horvitz-Thompson estimator to 
account for differences between actual and expected 
sample sizes under Poisson sampling was suggested by 
Brewer, Early and Joyce (1972). The methodology currently 
used to produce estimates based on the two-phase tax 
sample incorporates such adjustments. 

Ratio adjustments are applied within poststrata during 
weighting of both the first- and second-phase samples. 
Choudhry, Lavallée and Hidiroglou (1989) provide a 
general discussion of weighting for a two-phase Poisson 
sample using poststratified ratio adjustments. Suppose 
that first-phase poststratum u contains WN, taxfilers. An 
estimate of the number of taxfilers in the population that 
fall in first-phase poststratum uw, based on the first-phase 
sample, is 


N, me 3 (1/p);)- 


iesl Nu 


The poststratified first-phase weight for taxfiler 7, 7 € wu is 


Wii = (1/p1;) (N,/N,)- 
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An estimate of the number of taxfilers in second-phase 
poststratum vy, based on the first-phase sample, is 


ym 
iésl Nv 


An alternative estimate, using only units in the second- 
phase sample, is 


D Wy; /Poj)- 
ieés2Nv 


The poststratified second-phase weight for statistical entity 
(7,/) in poststratum v is 


Wo; = (1/pr;) (N,/N,) 
and the final weight is 
Wie WV 


The poststratified estimate of the total of y over domain 
dis 
Y(d) = )) wiyi(d). 
i€s2 


Choudhry, Lavallée and Hidiroglou (1989) note that the 
variance of Y(d) is approximately given by 


L 16 d 3 
Vala Ry eee aT OxC) ~ = 


ue i€u 


2 
Ap pee ret Qua = a ; 


fen PiiP2i 
where Y,(d) and Y,(d) are population totals for the 


variable y over the portions of the domain d belonging to 
poststrata uw and v respectively. 


This variance is estimated by 


ree G) 
“a He) 


Lex('@) 


(= py) One 
pind te eke as aol — , 
y Gin = =) 


2 
iesamury P1iP2i) 


where the estimates N,, and N, are calculated using final 
weights. 
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The inclusion of the factor (N,,/N,,)*(N,/N,)? can be 
motivated by an improvement in the conditional properties 
of the estimator (Royall and Eberhardt 1975). A variance 
estimator for the ratio estimator for a one-phase sample 
design including an analogous adjustment factor has also 
been studied by Wu (1982). Empirical work reported by 
Wuand Deng (1983) indicates that the coverage properties 
of confidence intervals based on the normal approxima- 
tion are improved using the adjustment factor. 

Y(d) is a particular case of Yorrq(d) that can be 
obtained if a single auxiliary variable with value one for 
all taxfilers is employed during both first-and second- 
phase weighting. In this case, we have g,; = N,,/N,, for 
all taxfilers in first-phase poststratum wu and g),; = N,/N, 
for all taxfilers in second-phase poststratum vy. Note that 
negative g-weights are precluded by this choice of auxiliary 
variables. The variance estimator V(Y(d)) differs in a 
minor way from the estimator V( Yorrc(d)) for this 
particular case of Yorrg(d). The second-phase g-weight 
appears in the leading term of V(Y(d)) but does not 
appear in V( Yorec(d)). 


4. EMPIRICAL STUDY 


In order to compare the performance of Yy_7(d), 
Y(d) and Yorrq(d), an empirical study was conducted 
using data from the province of Quebec for tax year 1989. 
Since the estimator Y(d) is a special case of Yorrc(d), 
it will be called Yorrc-tpy(d@) in subsequent discussion. 
(TPHis an abbreviation for two-phase Hajek.) Two other 
generalized regression estimators were considered. In both 
cases, x and z contains a variable with value one for all tax- 
filers. One generalized regression estimator involves cali- 
bration on taxfiler revenue during second-phase weighting. 
(Taxfiler revenue is included as a second auxiliary variable 
in z.) The second estimator involves calibration on taxfiler 
revenue at both phases of weighting. (Taxfiler revenue is 
included as a second auxiliary variable in both x and z.) 
Estimates of domain totals computed using these two esti- 
mators are denoted by Yorrc_r2(d) and Yorrc_rir2(d), 
respectively, in subsequent discussion. 

Estimates were produced for two variables of interest - 
transcribed revenue and total expenses. There are some 
conceptual differences between transcribed revenue and 
taxfiler revenue. For example, capital gains and extraor- 
dinary items are included in taxfiler revenue in many 
industries while they are excluded from transcribed rev- 
enue. In addition, taxfiler revenue contains more data 
capture errors than transcribed revenue since it is not 
subject to the same level of quality control. 

The population used for the study included about 
140,000 T2 taxfilers who reported over $25,000 in revenue 
for tax year 1989. The first- and second-phase selection 
probabilities used during sampling for production for tax 


year 1989 were employed. The first-phase sample included 
approximately 31,000 taxfilers and there were about 
23,000 businesses in the second-phase sample. The correla- 
tion between taxfiler revenue and transcribed revenue for 
businesses in the second-phase sample was 0.969, while the 
correlation between taxfiler revenue and total expenses 
was 0.960. 

Large proportions of units in the first- and second- 
phase samples were selected with certainty. All units with 
first-phase selection probability one were excluded from 
first-phase weighting and the corresponding g-weights 
were set to one. Units with second-phase selection pro- 
bability one were treated analogously during second-phase 
weighting. There were 9,884 units in the first-phase sample 
with first-phase selection probabilities different from one 
and 910 units in the second-phase sample with second- 
phase selection probabilities different from one. Each 
first-phase poststratum consisted of one or more of the 
first-phase sampling strata used during sampling for 1989 
production. These strata were defined using five revenue 
classes. All the sampling strata included in any particular 
first-phase poststratum corresponded to the same revenue 
class. Each first-phase poststratum contained a minimum 
of twenty sampled units. The use of a minimum sample size 
was motivated by concerns about the bias in V( Yorrc(d)) 
when the number of sampled units used for estimation of 
regression coefficients is very small (Rao 1968b). If a first- 
phase sampling stratum included fewer than twenty 
sampled units, it was combined with sampling strata for 
similar SIC2 codes and the same revenue class until a 
poststratum containing at least twenty sampled units was 
obtained. Application of this procedure led to 166 first- 
phase poststrata. Second-phase poststrata were formed 
analogously, combining sampling strata for similar SIC4 
codes to obtain a minimum sample size of twenty for each 
poststrata. There were 30 second-phase poststrata. 

First and second-phase weights for Yorgc-rpu(@), 
Yorrc_r2(d) and Yorrc-riro(d) were calculated using a 
modified version of the SAS macro CALMAR (Sautory 
1991). The set of first-phase sampling weights calculated 
for the GREG-R1R2 estimator included twelve negative 
weights. There were no negative second-phase weights 
calculated for either GREG-R2 or GREG-RI1R2. (Negative 
weights are not possible for the GREG-TPH estimator.) 
Estimates of transcribed revenue and total expenses were 
produced for 77 SIC2 domains, 256 SIC3 domains and 
587 SIC4 domains using the three GREG estimators, as 
well as Yy_7(d). Since GREG-R1R2 did not produce any 
negative estimates, no measures were taken to modify the 
negative weights associated with the estimator. 

Results of comparisons of the GREG-TPH and H-T 
estimators are presented in Table 1 and Table 2. The mean 
gains and mean losses reported in the tables are averages 
of ratios of coefficients of variation. The GREG-TPH 
estimator performs better than the H-T estimator for the 
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Table 1 
Comparison of GREG-TPH and H-T Estimators 
for Transcribed Revenue, Estimated Coefficients of Variation 


Losses Using 
GREG-TPH 


Gains Using 
Type of Domain GREG-TPH 


Number Mean Number Mean 


SIC2 5] 0.768 20 eUS 

SIGS 175 0.909 81 1.082 

SIC4 359 0.945 228 1.079 
Table 2 


_ Comparison of GREG-TPH and H-T Estimators 
for Total Expenses, Estimated Coefficients of Variation 


Losses Using 
GREG-TPH 


Gains Using 
Type of Domain GREG-TPH 


Number Mean Number Mean 


SIC2 57 O73 20 1.100 
SIG3 175 0.910 81 1.082 
SIC4 355 0.945 2D, L072) 


majority of domains. The gains obtained using GREG-TPH 
are particularly large for SIC2 domains. At the SIC4 
level, the estimated coefficient of variation (CV) for the 
GREG-TPH estimate of total expenses is lower than the 
estimated CV for the H-T estimate for 60.5% of domains. 
In cases in which the estimated CV for GREG-TPH is 
lower it is 5.5% smaller, on average, than the estimated 
CV for H-T. When the estimated CV for GREG-TPH is 
higher it is 7.9% larger than the estimated CV for H-T, on 
average. In addition to the information in Tables | and 2, 
there is another reason to prefer GREG-TPH to H-T. 
Each year, tax return information for some sampled 
taxfilers is not received by Statistics Canada or is unusable 
because it does not include the necessary financial state- 
ments. Assuming that such cases of nonresponse are 
ignorable, the GREG-TPH estimator provides an auto- 
matic nonresponse adjustment. 

The results in Tables | and 2 indicate that the relative 
performance of the GREG-TPH and H-T estimators are 
very similar for both variables of interest. The results of 
the other comparisons of estimators done as part of this 
empirical study did not depend on the variable of interest 
in any important way. Consequently, only results for total 
expenses are reported in subsequent tables. 

The GREG-TPH estimator is compared to GREG-R2 
and GREG-RIR2 in Tables 3 and 4. Based on estimated 
coefficients of variation, GREG-R2 performs slightly 
better than GREG-TPH. Since a large proportion of 
units in the second-phase tax sample have second-phase 
selection probability one and both GREG-R2 and 
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GREG-TPH use the same auxiliary variables during first- 
phase weighting, the marginal differences between GREG-R2 
and GREG-TPH are not surprising. Estimated CVs for 
GREG-RIR2 are generally smaller than estimated CVs for 
GREG-TPH and the relative performance of GREG-R1IR2 
improves as domain size increases. Nevertheless, 
GREG-R1R2 is superior to GREG-TPH for only 64% of 
SIC4 domains, and the average increase in estimated CVs 
for those domains in which GREG-RIR2 did worse than 
GREG-TPH is larger than the average decrease in estimated 
CVs for domains in which GREG-R1R2 performed better. 


Table 3 


Comparison of GREG-R2 and GREG-TPH Estimators for 
Total Expenses, Estimated Coefficients of Variation 


Gains Using No Losses Using 

Type of GREG-R2 Difference GREG-R2 
Donain © — 
Number Mean Number Number Mean 
SIC2 38 0.993 26 13 1.001 
SIC3 58 0.991 158 40 1.002 
SIC4 88 0.988 439 60 1.009 

Table 4 


Comparison of GREG-RIR2 and GREG-TPH Estimators for 
Total Expenses, Estimated Coefficients of Variation 


Losses Using 
GREG-RI1R2 


Gains Using 
Type of Domain GREG-RIR2 


Number Mean Number Mean 


SIC2 51 0.867 26 1.170 
SIC3 160 0.934 96 1.093 
SIC4 377 0.954 210 1.074 


The results in Tables 3 and 4 indicate that, although the 
GREG-R1R2 estimator shows some promise, it would be 
inappropriate to completely replace the GREG-TPH esti- 
mator currently used in production by GREG-RIR2. The 
improvements obtained using GREG-R1R2 are relatively 
marginal, given the strong correlation between taxfiler 
revenue and total expenses. Larger improvements could 
be obtained if: (1) SIC codes used for first-and second- 
phase stratification were always consistent with SIC codes 
used to determine the domain membership of sampled 
units; and (ii) formation of first-and second-phase 
poststrata did not require combination of sampling strata 
to obtain a minimum sample size in each poststratum. 

The results reported in Table 5 were obtained after SIC 
codes assigned to taxfilers by Revenue Canada and SIC 
codes used for stratification of the second-phase sample 
were changed for sampled units, where necessary, to 
eliminate inconsistencies between these codes and those 
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Table 5 


Comparison of GREG-R1IR2 and GREG-TPH Estimators for 
Total Expenses, Estimated Coefficients of Variation, 
No Misclassification 


Gains Using 
GREG-RIR2 


Losses Using 
Type of Domain GREG-RI1R2 


Number Mean Number Mean 


SIC2 66 0.778 11 1.057 
SIC3 184 0.916 72 1.047 
SIC4 402 0.944 185 1.034 


used to determine domain membership. A comparison of 
Tables 4 and 5 indicates that the relative performance of 
GREG-RIR2 is considerably better when there are no 
classification errors. GREG-R1R2 reduces estimated CVs 
by over 22% (on average) for over 85% of SIC2 domains. 

Throughout the empirical results reported here, perfor- 
mance improvements obtained through the use of addi- 
tional auxiliary information increase as domain size 
increases. This result is consistent with the observations 
in Section 2 concerning the conditions under which corre- 
lations between y(d) and the vectors of auxiliary variables, 
x and z, will be high. Provided that the variable of interest 
and the auxiliary variables are highly correlated, correla- 
tions involving y(d) will be strong if each poststratum 
containing at least one sampled unit falling in domain d 
also contains relatively few sampled units that do not fall 
in domain d. 


5. CONCLUSIONS 


Generalized regression estimation provides a conve- 
nient framework for the use of auxiliary information. A 
generalized regression estimator for a two-phase sample 
design with Poisson sampling at both phases of selection 
is derived in this paper. The efficiency of the estimator is 
investigated through application to the two-phase tax 
sample selected by Statistics Canada to obtain annual 
estimates of the economic activity of small businesses. The 
estimation method currently used in production for this 
survey incorporates poststratified ratio adjustments during 
both first-and second-phase weighting to compensate for 
differences between actual and expected sample sizes. This 
poststratified estimator is a particular case of the gener- 
alized regression estimator. 

In an empirical study, the generalized regression esti- 
mator currently used in production (GREG-TPH) performs 
much better then the Horvitz-Thompson estimator. Two 
other generalized regression estimators are also compared 
to GREG-TPH. The alternative estimators produce improv- 
ements for large domains. However, their performance for 
the smaller domains that are of particular interest to users 


of estimates based on the two-phase tax sample does not 
justify complete replacement of the current production 
methodology. 
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APPENDIX A: 
DERIVATION OF VARIANCE 
OF Yorrc(d) AND VARIANCE ESTIMATOR 


The variance of Yorrq(d) can be derived using the 
identity 


V(Yorec(@)) = E,Vo( Yorec(@)) + ViE2( Yorec(@)).- 


First, consider the variance of the estimator with respect 
to the second phase of sampling, conditional on the results 
of first-phase calibration. If the vector of auxiliary vari- 
ables for second-phase weighting, z, includes a variable 
with value one for all taxfilers (or a linear combination of 
auxiliary variables that is equal to one for all taxfilers can 
be constructed), the generalized regression estimator can 
be written as 


Yorec(d) = ys W 1; W) Vi(a) 


i€s2 


= Suprise Wares Zoe 


vy 1és2Mv y 


Ignoring the variability due to the estimation of regres- 
sion coefficients during second-phase weighting, we have 


E\V>(Yorec) ~ E; 1( Mg W,O/P2 


i€s2 


= a »S (1 iA P2i) Wi.08). 


i€s] P2i 


The estimator of FE, V)(Yorrq(d)) based on the vari- 
ance estimator for calibration estimators advocated by 
Deville and Sarndal (1992, p. 380) is 


5 Cl Psi) 
5) = yy ne (£1182: Q2i)*- 
jen (Pi P21) 
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Ignoring variability due to the estimation of regression 
coefficients during first-phase weighting, the second term 
in the variance expression can be written as 


ViE>(Yorec(a)) = Hi( s wand) 


iésl 


= 
=e ( Cl Pi) OF. 


Prii 


An estimator of this term is 


Ce) 
= Ds "(811911)". 
ies2 = Pui P2i 


~—~A) 
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Two-Stage Area Frame Sampling on Square Segments 
for Farm Surveys 


F.J. GALLEGO, J. DELINCE and E. CARFAGNA! 


ABSTRACT 


In the MARS Project (Monitoring Agriculture with Remote Sensing) of the E.C. (European Community), area frames 
based on a square grid are used for area estimation through ground surveys and high resolution satellite images. 
These satellite images are useful, though expensive, for area estimation: their use for yield estimation is not yet 
operational. To fill this gap the sample elements (segments) of the area survey are used as well for sampling farms 
with a template of points overlaid on the segment. Most often we use a fixed number of points per segment. Farmers 
are asked to provide global data for the farm, and estimates are computed with a Horvitz-Thompson approach. 
Major problems include locating farmers and checking for misunderstanding of instructions. Good results are 
obtained for area and for production of the main crops. Area frames need to be complemented with list frames 


(multiple frames) to give reliable estimates for livestock. 


KEY WORDS: Area frame; Point sampling; Segment sampling; Farm sampling. 


1. INTRODUCTION 


The main purpose of this paper is to present the method 
used to sample farms in an area frame by the MARS 
(Monitoring Agriculture with Remote Sensing) Project of 
the European Community (EC). Sampling farms is not a 
central activity in this project, but rather a way of bypassing 
the limitations of the actual capacity of satellite images, 
especially for yield estimation. We shall present a brief 
overview of the MARS Project to make up for the few 
existing references in statistical journals (Ambrosio 1993, 
Gallego 1992). Other presentations can be found in confer- 
ence papers (Meyer Roux 1990, Delincé 1990, Sharman 
et al. 1992, Carfagna et al. 1994) or remote sensing journals 
(Gonzalez et al. 1991, Gallego et al. 1993). 


2. THE MARS PROJECT OF THE EUROPEAN 
COMMUNITY 


The MARS Project was launched in 1988 to assess and 
to develop operational applications of Remote Sensing to 
Agricultural Statistics. It is carried out by the Institute of 
Remote Sensing Applications (IRSA) of the Joint Research 
Centre (JRC) of the EC. Most of the activities of the period 
1988-1993 were divided into 4 main parts, named ‘‘actions’’: 


(1) Regional Crop Inventories. 

(2) Monitoring Vegetation. 

(3) Agrometeorological Models. 

(4) Rapid Estimates at the EC level. 


Some work is made as well in other related fields, such 
as area frame sampling. We shall focus here on a sampling 


method used in the frame of action 1 ‘‘Regional Inventories’’, 
but we shall first say a word about the other actions. 


2.1 Monitoring Vegetation 


This action deals with low resolution satellite images 
from NOAA-AVHRR (Advanced Very High Resolution 
Radiometer). In these images each pixel has about 1 km? 
in the vertical of the satellite orbit. The main objectives 
are the development of friendly software for the pre- 
treatment of these images, and building a data bank with 
time series vegetation indexes and other indicators for 
about 3,000 monitoring units in the EC. These monitoring 
units have not yet been definitely defined. They should be 
geographic areas roughly between 500 km? and 1,000 km? 
with a more or less homogeneous vegetation or greenness 
index (Houston 1984, Goward 1991). 


2.2 Agrometeorological Models 


General and crop specific models are being currently 
developed on the basis of data from a network of about 
650 Meteorological Observatories in Europe and surroun- 
ding areas. This model CGMS (Crop Growth Monitoring 
System), developed in collaboration with the WOFOST 
(World Food Studies Centre, in Wageningen, Netherlands), 
also uses other data, such as soil and elevation data, 
together with information on the physiology of plants 
(van Diepen 1989, van Lanen 1992). Remote sensing (low 
resolution images) will come into the picture later for the 
spatial interpolation of ground observed meteorological 
data. Parameters of the model are currently computed for 
each cell of a50km xX 50 km grid. 


! F.J. Gallego, Joint Research Centre of the European Communities, tp. 440, 21020 Ispra, Varese, Italy; J. Delincé, Commission of the E.C. DG VI, 
Loi 120, 4-23, 1049 Brussels, Belgium; E. Carfagna, Department of Statistics, University of Bologna, V. Belle Arti 41, 40126 Bologna, Italy. 
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2.3 Rapid Estimates at the E.C. Level 


The main goal is giving rapid estimates of area and yield 
change of annual crops compared with the previous year 
based on a two-stage sampling scheme: 53 sites (Figure 1) 
of 40km x 40km witha sample of 16 squared segments 
of 700m x 700m (Figure 2) in each of the sites. Individual 
data are acquired by photo-interpretation of SPOT-XS or 
Landsat-TM images. An average of three images is ana- 
lysed for each site with a minimum of ground information, 
namely a general knowledge of the dominant crops in each 
area. A ground survey is made for an a posteriori valida- 
tion of the photo-interpretation. A monthly report (from 
March to November) is produced with an update of the 
estimates. Each report should use all the images acquired 
more than 15 days before. 


ee, 


Figure 1. Sample of 53 sites for rapid crop estimates in the E.C. 


Figure 2. Segments in one site (rapid estimates in the E.C.) 


2.4 Regional Crop Inventories by Segment Sampling 
and Remote Sensing 


The objective of the action was to implement, to adapt 
and to assess estimation methods for crop area and pro- 
duction based on area frame sampling and satellite images. 
When this action was implemented by the IRSA in 1988 
on five pilot regions of approximately 20,000 km? each; 
an absolute priority was given to annual crops: soft and 
durum wheat, barley, rapeseed, dried pulses, sunflower, 
maize, cotton, tobacco, sugar beet, potatoes, rice and 
soya, as well as fallow. Attention is being shared more and 
more by permanent crops, pastures, and non-agricultural 
land uses. 

Since 1990 the IRSA has progressively transferred the 
initiative to regional or national administrations that wish 
to use area frame surveys based on segments. In general, 
the activities have been shifted to the southern countries 
of the EC and the former communist countries in central 
Europe, that have shown much interest in the method 
(Figure 3). In some cases, like in Italy, there is just an 
exchange of points of view between the national project 
and the IRSA. 


Degree of JRC involvement 
ES High 

g 
Technical 


support 


Collaboration 


Figure 3. European regions with segment surveys in 1992. 


2.4.1 Sampling Segments on a Square Grid 


There are two main approaches to building an area 
frame based on segments: the segments can be drawn on 
topographic or cadastral maps following roads, rivers, or 
limits of fields (sometimes called cadastral segments). The 
sample is usually drawn with a two-stage procedure with 
intermediate primary sample units to reduce the burden 
to build the frame (Cotter 1987), which remains in any case 
a heavy operation. 
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We generally use frames based on a square grid (Gallego 
and Delincé 1994), which is much faster to define. Satellite 
images are generally (but not necessarily) used for strati- 
fication prior to sampling. 

Figure 4 illustrates a small example of this kind of 
sample with a very simple stratification and segments of 
25 ha (hectares). Sampling is systematic, repeating a pat- 
tern in square blocks. In this case the blocks have a size 
of 10km x 10km, and the pattern has 4 replicas in the 
most agricultural stratum (plain), 2 replicas in the hills, 
and one in the mountains. 


eeaments 
by 2 stra 


Figure 4. Example of area frame sample with squared segments 
and squared blocks. 


The main drawback of this approach is the management 
of segments that fall on the boundary between two strata 
(Figure 5). Three alternatives are being tested for this 
problem: (1) adapting the stratification to the sampling 
grid, (2) splitting border segments into pieces that belong 
to different strata, and (3) keeping only the largest one 
among these pieces. 

The most frequent non-sampling errors — shifts in loca- 
tion and inaccuracy in shape or size of the segment - are 
not strongly correlated with land use. No major influence 
has been found on the area estimates or their precision. 

The sample pattern to be repeated in each block is 
drawn at random with a restriction on the distance between 
segments in order to avoid segments that are too close to 
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stratum 2 


stratum | 


Figure 5. A segment can be split by a stratum boundary. 


each other. Cluster estimators can be used in this case 
rather than standard formulae for random sampling 
(Fuentes 1994, Ambrosio 1993). Systematic sampling has 
a risk of bias if there is a cyclic effect in the landscape with 
a period that coincides with the block size (10 km in the 
example), but this is very unlikely. The distance threshold 
between segments can induce an overestimation of standard 
errors if the spatial correlation is significantly positive for 
distances less than the threshold. 

The size of the segments varies from region to region 
depending on the agricultural landscape, especially on the 
size of fields. In the Czech Republic, the segment size 
was 400 ha. For the area survey, enumerators locate the 
segments, draw fields on a transparent sheet placed over 
an aerial photograph, and write down their land use. 
About 5% to 10% of the segments are visited again by 
supervisors to check for possible errors on the ground 
work. Satellite images are not used either for the survey 
itself or for the farm survey, but they can be optionally 
used to improve the precision of the area estimates as 
described in the next section. 


110 Gallego, Delincé and Carfagna: Two-Stage Area Frame Sampling on Square Segments for Farm Surveys 


2.4.2 Improving Area Estimates with Satellite Images 


High resolution satellite images from Landsat-TM or 
SPOT-XS sensors have been assessed and are still being used 
at moderate scale to improve the estimates obtained from 
the ground survey on a sample of segments. The most com- 
monly used approach is the regression estimator on classified 
images. An alternative estimator based on confusion matrices 
has been tested with results that are very close to those of 
the regression estimator (Hay 1988, Gallego 1994). 

The conclusions of this assessment are similar to those 
of the US Department of Agriculture (Allen 1990). The 
use of satellite images for area estimation is operational, 
but still too expensive for the efficiency obtained. The 
economic threshold can be reached by improving image 
processing automation, since the cost of image processing 
in the European market for this purpose is much higher 
than the cost of the images themselves. This threshold has 
nearly been reached with Landsat-TM images in Greece. 
Different conclusions on cost analysis are presented by 
Giovacchini (1992). 


3. SAMPLING FARMS BY POINTS 


For agricultural surveys in the European Community, 
farms are traditionally sampled from a list frame (Eurostat 
1991). The list is a census of farms that exceed a certain 
size threshold. In many countries an agricultural census 
is made every 10 years and is seldom updated (if ever). 
Hence there may be a substantial difference between the 
sampling frame and the actual population at the date of 
the survey. The situation is worse in the central European 
countries of the former Eastern Block (the area between 
Poland and Rumania-Bulgaria), where the change of land 
property structure is so rapid that the census may not exist 
for private farms and becomes obsolete for co-operatives. 

Area frames on square segments can be easily defined 
when the geographic borders of the region are known. A 
subsample of these segments is used as well for sampling 
farms in several countries with the help of a template of 
points overlaid on the segment. This has been experimen- 
tally tested in Germany, Portugal, Italy (Carfagna 1991) 
and Spain, and is now being regularly used in Greece, 
Rumania and the Czech Republic. 

The template is the same for all the segments in a 
stratum, and usually symmetric to reduce the risk of bias 
due to a particular geographic location. Data are obtained 
only for farms corresponding to points falling on Utilized 
Agricultural Area (UAA). 

The definition of UAA used in the field work is adapted 
to each national system. Farm buildings and rough pastures 
are included in some countries and excluded in other 
countries. The crucial point is that the definition used must 
be consistent with the definition of the column UAA used 
for computation (Table 1). 


Table 1 


Observations Generated by Points Sampled 
in the Segment of Figure 6 


Pp Wheat Barley 
erma- 
Segment Point UAA oh a Dac " Pee 
P pom tion ae tion 

1 1 19 4 2, 64 0 0 

1 2 (0) 0) 0 (0) 0 0 

1 3 (0) 0 0 0 0 0 

1 4 65) 0 24 131 3 12 

1 5 35) 0) 24 131 3 12 

2 


In the example of figure 6, point 3 fell on woodland and 
point 2 on a built area. They will generate two zero-valued 
records in the farm file. The enumerator will have to locate 
the farmers for the other three points. The farm correspon- 
ding to point 1 has other fields in the segment, that will 
be implicitly included in the survey, but the enumerator 
will not need to find out if these fields exist. Points 4 and 5 
belong to the same farm, and it will appear twice in the 
farm file (Table 1). 


Figure 6. Segment with a pattern of 5 points for farm sampling. 


Farmers are located and asked to provide global data 
for the farm, including total area and production of each 
target crop. No question is asked about the production of 
each field or the set of fields inside the segment. This is 
not necessary because in the final formulae to compute the 
estimates (formulae 2 and 3 in section 4.1) the crop area 
or the production in the tract is not used. 
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The ground survey instructions are usually transferred 
from the JRC to National Administrations. They explain 
the instructions to Regional co-ordinators, who give the 
information to the enumerators. Instructions may be 
modified in some of these steps. Checking that the instruc- 
tions have not been misunderstood is sometimes difficult, 
in part because linguistic limitations are a serious barrier 
to direct contact with enumerators. In some countries 
(e.g., Spain) farmers live mainly in rather large urban 
nuclei and are difficult to locate; this can lead to a signifi- 
cant amount of missing data. 


4. ESTIMATES BASED ON FARMS 
SAMPLED BY POINTS 


We assume that the population Q of segments is divided 
into strataQ,,h = 1, ..., H, the total population size is 
N segments (N,, for stratum Q,) and the sample size is n 
segments (n;,). The size of our sample of points in each 
segment will be K;, previously fixed; in general we have 
K; = K, constant across all strata, out of which F; corre- 
spond to the farms on which these points fall. Each 
segment / has a total UAA surface Uj. 

We have a two-staged sampling scheme. In the first 
stage the segment / is selected with probability p; = 1/N, 
in each of the n, trials. In the second stage the unit is not 
the farm but the tract (UAA in a segment, that belongs 
to the same farm). The tract k of segment / has an area 
T;,. The total UAA of the farm is A;, over all segments. 
U; is the sum of the tracts 7}, in segment /. 

The method presented below is closely related to the so 
called ‘‘weighted segment estimator’’ approach used in the 
U.S. and in Canada (Nealon 1984). 


4.1 Estimates Based on Farms and Non-Farm Points 


There will be K — F; observations (fictitious farms) 
with value 0 corresponding to points outside the UAA. 

Sampling through points means that tracts are selected 
with replacement and with a probability p, proportional 
to the area 7;,/D;, (the knowledge of 7;, is not necessary), 
where D; is the size of the segment determined by the 
frame design. We are implicitly assuming that the surveyed 
region is flat. A slight bias might be introduced by the fact 
that annual crops are usually on more or less flat land and 
pastures or non-UAA are often on land with a steeper 
slope. 

The sampling is done with replacement: a farm can be 
selected more than once, which gives easier formulae for 
variance estimation. Strictly speaking the joint selection 
probability that farms k and k’ arein the sample px, # 
Dikk X Pigx aS Would be the case if the different points of 
the template were drawn independently, since there is 
usually a relatively large distance between them. We will 
disregard this fact in this paper. 
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W,,, will be an additive quantity for a farm, most often 
the production or the area of a particular crop. It is 
obvious that yield is not an additive variable. 

Since we have no information about how W;, is distrib- 
uted inside the farm, we create a fictitious variable X that 
is uniformly distributed, and that has, by definition, the 
same total as W for each farm: 


Xig = tie Wig (1) 
ice aa TKS 

Aix 

Estimating the totals of X and W are equivalent 
problems. 


The two-stage version of the Horvitz-Thompson esti- 
mator for the total of X in the stratum Q, gives: 


ppreWeennidine Ns ei bs iit 

= = = 

Ui ARE ad 13 eae K; ep tik 
Nn A Di A Wit Q) 
Wiernsiy Slerayttik 


This means that, even if the second stage sampling unit 
is the tract, we do not need to know its area nor Xz, but 
just the global information about the farm. 

The estimator is a linear function of the estimates on 
the selected segments. Its variance in stratum (2, can be 
estimated as (Cochran 1977, section 11.6): 


as N; PL Nek se 
V(X,) a8 aM i Nh 3 ( i hn) a 
Nh ny, = 1 


i 


H 
V(X) = YS V(%). (4) 


Crop areas are currently estimated from the segment 
survey with more objective ground data (direct observation 
of the enumerator on the filed), although some bias can 
appear due to the imperfect location of the segments on 
the ground. Farm surveys provide both area and produc- 
tion estimates, but they can have more significant bias due 
to non response and to a subjective tendency of the farmer 
that can depend on whether he is more concerned about 
taxes or about subsidies at the time of the survey. Com- 
paring both area estimates, from segment survey and farm 
survey, can be useful to check for possible bias on the 
production estimate based on the farm survey. 
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The estimates are also possible for cattle, but the results 
will be presumably bad if there are a substantial number 
of farms without any UAA, which will not be sampled: 
the coverage of the area frame will not be complete in this 
case. On the other hand it may happen that the number 
of livestock does not correlate to the UAA and hence to 
the probability of selection. This results in inefficient 
estimates. 

A program in C for Personal Computers has been 
written (Dicorato 1993) to compute estimates using this 
method. The main part of the program was first written 
to compute estimates on a segment survey. 


4.2 Estimation Based Only on Farm Points 


We shall mention another option that consists of using 
only points that fall in the UAA. In this case, we first fix 
F,, the number of points that fall in UAA (often F; = F;,, 
constant in each stratum). In segment / we observe as many 
points as necessary to have F; points in the UAA. If the 
segment i has no UAA, one observation (fictitious farm) 
is added with 0 values. This is actually an implicit second- 
stage stratification or stratification of the first-stage units 
(segments) into two strata; UAA and non-UAA. The non- 
UAA stratum is not sampled. In this case (2) and (3) are 
to be adapted substituting K; by F; and D; by U;. Some 
inconsistency may arise in hilly areas because A; comes 
from the farmer’s declaration and U; from segments 
drawn on the ground over aerial photographs. 


1 
1 Nh Xx; N, Nh 


geet 4 

OG ) = ) = 3) 

j Pj Ap Hie ae » 
n=l k=l 


Neer 1 SLOW SOLE 
h Sy De Kimo x $e \ (6) 
Np F(F, = 1) Aix 


i= yo k=1 


the second term of (6) is null for segments with no UAA. 
This term cannot be computed if F; = 1 because of non- 
response. A value 0 can be attributed, though this will lead 
to an underestimation of the within-segment variance, 
which is relatively small according to calculations made 
on available data (Carfagna 1992). 

This approach has only been used once to resolve a 
misunderstanding of instructions for ground work that 
should have been performed following the method in 
section 4.1. However advantages and drawbacks of both 
approaches are not clear, and no systematic comparison 
has been made so far on the same region and same year. 
Using only farm points can increase the cost of the survey 
if the number of points per segment is to be kept constant, 


but the non-UAA points removed correspond to null 
values of W;,, and their removal can result in a reduction 
of the variance. 


4.3 Farms with Fields in Different Strata 


At first sight, the estimator (2) seems to assume that a 
farm k that has been selected through a point in stratum 
Q,, is completely included in this stratum. It is obvious 
that a farm can have fields in different strata, and the 
question arises as to whether this fact disturbs the reli- 
ability of the results. 

We stress again that the variable used is not really W;,, 
but X;, defined for each individual tract. The total of W 
does not coincide with the total of X in each stratum, but 
it does in the whole region as long as 


hw (7) 


Notice that A, is identical to what we have called pre- 
viously A;,, where the subindex / is used only to indicate 
that farm k has been selected in the sample through 
segment /. 

This identity holds on the population, regardless of the 
sampling procedure, if the farms are entirely inside the 
region and if the geometry of the ground survey document 
(aerial photograph) is correct. 

The perturbation due to farms with fields in different 
regions is expected to be small because of the low propor- 
tion (generally under 1-2%) and because there is a com- 
pensation between the bias due to fields inside the region 
belonging to farms with the headquarters outside the 
region and vice versa. We are assuming that the total of 
W is calculated on the farms that have their headquarters 
inside the surveyed region. 


4.4 Nonresponse 


We refer here to the estimators based on farm and non- 
farm points (section 4.1). If a farmer does not co-operate 
or cannot be found, the corresponding row or rows of the 
input table (Table 1) are substituted with the average 
values of responding farms in the segment, if there are any; 
otherwise they are substituted with the average of respon- 
ding farms for all the segments in the current stratum. 

If in the second stage (sampling farms inside the segment) 
we consider farm and non-farm points, and give value 0 
to the points that fall in non agricultural land, it is obvious 
that the exclusion of nonrespondents would produce a 
serious bias, because the zero values corresponding to non- 
UAA are never missing. These points are not used to 
compute the ‘‘average farm’’ values used to fill missing 
values. There is still a risk of bias if farmers who cannot 
be located or refuse to co-operate have a peculiar behaviour, 
e.g., if they are on the average smaller or less efficient farms. 
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We could have considered a different way of over- 
coming this problem: eliminating both missing values and 
a proportional number of 0 values corresponding to non- 
UAA points. Both give the same estimate for the total, but 
the second solution is more uncomfortable because the 
sample size in the second stage is not an integer any more. 

The introduction of ‘‘average farm’’ values will lead to 
a negative bias on the variance. To compensate it, the farm 
is not included in the sample size K; for the computation 
of variances. 


5. RESULTS: TWO EXAMPLES 


‘We discuss below some results from two regions: Emilia 
Romagna (Italy) and the Czech Republic. In the Czech 
Republic, the method presented in sections 2.4, 3 and 4.1 
was used; there were no missing data at all. In Emilia 
Romagna the general design of the survey did not follow 
exactly the procedure outlined above. Missing data were 
treated as stated in section 4.4. 


5.1 Emilia Romagna 1990 


In Emilia Romagna an area of 19,500 km? was divided 
into 4 strata, excluding mountainous areas. A sample of 
313 ‘‘cadastral’’ segments (with physical boundaries) was 
drawn based on a two-staged procedure with primary 
sampling units (psu) of about 10 km. Segment size was 
approximately 50 hectares or 100 hectares, depending on 
the strata. 5 points per segment were drawn at random 
from a grid with a 50 metre step. 

Out of the 1,565 points sampled: 326 were non-UAA, 
the farmer’s address could not be found for 206 UAA 
points, 38 farmers were not located and 32 refused to 
co-operate. 963 UAA points from 285 segments had valid 
data, corresponding to 617 farms, some of which appear 
more than once in the sample. 

When we think only of area estimation, the segment 
survey can be seen as more objective and complete, since 
there are no missing data and observations do not rely on 
farmers’ answers. If we accept this principle we can have 
an idea of a possible bias in the farm survey by comparing 
with the area estimates of the segment survey. Estimates 
can be compared in Table 2 for the main crops in the 
region. Figures match well for cereals, excepting durum 
wheat, and permanent crops, but some problems appear 
for sugar beet and soya, that might be related to misunder- 
standings on how to declare second crops in the same year 
and the same field, or with a bias due to missing values. 
Official statistics are produced taking into account a 
variety of information. Durum wheat is reported sepa- 
rately because of the special meaning of this crop due to 
the significant subsidy granted by the EC to each hectare 
of crop. 
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Table 2 


Results of the Segment Survey and the Farm Survey 
for Main Crops in Emilia Romagna (1990) 


Seencnt Farm Survey ISTAT 
Survey 
Emilia Area X Area X Prod. x 
Romagna 1,000 ha 1,000 ha 1,000 tm 
Ss Area 
Esti- CV Esti- CV Esti- CV 
mation % mation % #£mation % 
Soft wheat 212 Sy i 208 619 eel 7 8 212 
Durum wheat 46 14.9 48 15,2 260 14 WW? 
Barley 43 12) 50 Weg 184 38 
Rice - - 4 59.0 23a5 16) 6 
Sugar beet 111 Tele 96 9.6 5,474 28 119 
Soybeans 76 6.0* 55 11.6 SPA oe 47 
Vineyards 78 IIejgete 76 18.7 TS 


Orchards Hi Sie OS 19.7 85 


* Estimate corrected by regression on classified satellite image. 
ISTAT: Official statistics. No precision provided. 


The coefficients of variation in the farm survey have 
a reasonable behaviour for cereals, but become more 
difficult to understand for sugar beet and soybeans. The 
high CV (Coefficient of variation) for the production can 
be due to higher yields in larger, more specialized farms. 

A correction of the production estimate can be made 
using the difference of area estimates between the segment 
survey and the farm survey. A regression estimator 
approach might be a good solution. 

Livestock is seriously underestimated (Table 3) since 
many livestock owners do not have agricultural land. A 
mixed approach was used for cattle and pigs with an 
exhaustive survey using a list frame for the SO largest farms 
and point sampling for the rest. The procedure works for 
pigs, but CVs are not yet satisfactory. 


Table 3 


Results of the Farm Survey on Area Frame and Mixed 
Frame for Livestock in Emilia Romagna (1990) 

Area Frame Mixed Frame 

x 1,000 Units Census 


Estimate CV % Estimate CV % 


Cattle 869 829 14 894 13 
Pigs 1,876 esi 37 1,818 Di 
Sheep 90 38 74 


5.2 Czech Republic 1992 


Area frames seem especially useful in the former com- 
munist countries in Europe because of the rapid change 
of property structure. Agricultural statistics are mainly 
produced with no sampling error by adding the data 
reported by each state farm or co-operative. This proce- 
dure will collapse in the coming years. It will be extremely 
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difficult to have an idea of the number of existing farms, 
and an agricultural census will be out of date before 
the data are elaborated. Area frames might be the best 
alternative. 

The territory of the Czech Republic (about 80,000 km?) 
has been stratified into 6 strata by photo-interpretation of 
Landsat-TM images. The stratification needed 15 working 
days for one person. In 1992, a survey was made with a 
sample of 417 square segments of 400 ha drawn by repe- 
tition of a fixed pattern on blocks of 40km x 40km. 
Segments were visited and area estimates obtained as 
explained in section 2.4.1. 

Farms have been sampled using a fixed grid of 5 points 
in each segment. The shape of the 5-point grid was in ‘‘x’’ 
like in figure 6. This procedure gave 2,085 points: 858 non- 
agricultural, and the other 1,227 from 458 farms. No 
missing data were recorded: all the farms were identified 
and none refused to co-operate. This happened mainly 
because the old structure of large farms was still nearly 
intact. 

Table 4 compares the results of the segment survey 
(direct observations on the field), the farm survey (farms 
sampled by points), and official statistics for the main 
crops in the country. Official statistics are obtained by 
adding figures reported by all the state farms or co- 
operatives. There is a moderate disagreement on area 
estimates for wheat, maize, and potatoes. We should not 
exclude a bias in farmers’ answers that has to do with self- 
consumption of agricultural products. 


Table 4 
Results of the Segment Survey and the Farm Survey 
in the Czech Republic (1992) 


Segment 
01/000 ha kao 
Area CV% Area CV% Prod. CV% Area Prod. 


Farm Survey CSO 


Wheat 824 Sty Oy ee 3) 5, See 4 OU OU SKA 


Barley 655 Spl O30) 3.80 6 25a Ate) OO NOD 
Rapeseed TAQ ICG. 375 1628 SO S36 296 
Sugar beete 9 Stes yates 4 LO eS mes) 8r 4: 
Maize 361 7.5 326 4.8 8,884 4.3 361 8,904 


Potatoes 109 13.6 92 1299 15706 Sa SINT 1969 
CSO: Czech Statistical Office. 


The coefficients of variation (CV) of the area estimates 
are lower in the farm survey than in the segment survey. 
This is not surprising since the farm survey gives informa- 
tion about fields outside the segments. The 458 selected 
farms represent more than 15% of the total UAA in the 
country. The CVs for production estimates are slightly 
higher than for area estimates (even lower in the case of 
maize). This seems to indicate that the variability of yields 
contributes less than the variability of areas to the vari- 
ability of production. 


6. CONCLUSIONS AND RECOMMENDATIONS 


Area frames based on square grids are a pragmatic 
alternative to area frames based on ground elements 
delimited by physical features. They are much cheaper to 
build and they do not seem to have major drawbacks 
regarding the final results. However some theoretical work 
is still needed to determine under which conditions the 
location errors due to non-physical limits have a negligible 
effect on the estimates. 

Sampling points inside area segments provides a feasible 
way to build frames for farm sampling. They are extremely 
useful if list frames (census) are poorly updated or do not 
exist. Sampling a few points per segment can be much 
cheaper than surveying all the farms with fields in the 
segment. Five points per segment seems to be a reasonable 
choice. 

Area frames alone give poor results for livestock when 
the number of units is not strongly correlated with Utilized 
Agricultural Area of the farm. 
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Use of Capture-Recapture Techniques to Estimate 
Population Size and Population Totals when 
a Complete Frame is Unavailable 


K.H. POLLOCK, S.C. TURNER and C.A. BROWN! 


ABSTRACT 


We present a formal model based sampling solution to the problem of estimating list frame size based on capture- 
recapture sampling which has been widely used for animal populations and for adjusting the US census. For two 
incomplete lists it is easy to estimate total frame size using the Lincoln-Petersen estimator. This estimator is model 
based with a key assumption being independence of the two lists. Once an estimator of the population (frame) size 
has been obtained it is possible to obtain an estimator of a population total for some characteristic if a sample of 
units has that characteristic measured. A discussion of the properties of this estimator will be presented. An example 
where the establishments are fishing boats taking part in an ocean fishery off the Atlantic Coast of the United States 
is presented. Estimation of frame size and then population totals using a capture-recapture model is likely to have 
broad application in establishment surveys due to practicality and cost savings but possible biases due to assumption 


violations need to be considered. 


KEY WORDS: Incomplete frames; Capture-recapture sampling; Angler surveys; Telephone surveys; Access surveys. 


1. INTRODUCTION 


In classical sampling theory it is assumed that a complete 
frame exists. There is, at least conceptually, a complete 
list of population units. It is then possible to draw a prob- 
ability sample from the population. Estimators of popula- 
tion parameters such as mean or total then have known 
properties and are easily studied theoretically or numer- 
ically. Books on sampling theory such as Cochran (1978) 
concentrate on this situation and give properties of esti- 
mators for common sampling designs such as simple 
random sampling, stratified random sampling and multi- 
stage (cluster) sampling. 

In practice in surveys of establishments or businesses 
a complete frame may not exist. Lists of establishments 
kept by professional associations or government agencies 
are often incomplete. One approach to tackling this 
problem is to use the multi-frame approach originally 
developed by Hartley (1962, 1974). Examples of this 
approach are the National Agricultural Statistics Service 
(USDA) farm surveys (Vogel and Kott 1993). These 
surveys use an incomplete list frame of farms plus an area 
frame where all farms within a sample unit are enumerated. 
Therefore the list frame 1s incomplete while the area frame 
is conceptually complete. (There is a list of all area units 
and within each area unit theoretically all farms could be 
enumerated.) 

There are some situations, however, where it may not 
be possible to use an area frame for practical reasons. All 
that the researcher may have available may be several 


incomplete list frames of establishments. The usual 
approach in this situation is to merge all the incomplete 
lists and ignore any remaining incompleteness. Depending 
on the degree of incompleteness remaining there could be 
serious negative bias on estimates of population size and 
population total. 

Later we present a formal model based sampling solution 
to this problem based on capture-recapture sampling. 
Capture-recapture sampling models are widely used in 
sampling animal populations (Seber 1982) and also for 
adjusting the U.S. census for undercoverage (Feinberg 
1992). In the simplest case of two incomplete lists we 
consider ‘‘marked”’ units to be those which occur on both 
lists and unmarked units to be those which do not occur 
on both lists. It is easy to estimate total frame size using 
the Lincoln-Petersen estimator (Seber 1982, p. 59). This 
estimator is model based with a key assumption being 
independence of the two lists. Once an estimator of the 
population size has been obtained it is possible to obtain 
an estimator of population total for some characteristic 
if a sample of units has that characteristic measured. 

The usual estimator of a population total for simple 
random sampling without replacement is 


Y SNy. (1.1) 


where N is known and ¥ is the mean of the sample, see 
for example Cochran (1978, p. 21) . The variance of Y is 
given by 


Var(Y) = N-Var(). (1.2) 


! KH. Pollock, North Carolina State University, Raleigh, NC 27695; S.C. Turner and C.A. Brown, National Marine Fisheries Service, Miami, 


FL 33149, U.S.A. 
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where 


Var (7) SiN 
yy a n N ’ 


S* is the population variance and (N — n/N) is called 
the finite population correction factor. The estimator (1.1) 
is also an unbiased estimator of the population total. 


Here our estimator is 
Y = Ny, (13) 


where Nis obtained from the capture-recapture method. 


This means the properties of the estimator (1.3) are more 
difficult to evaluate because both N and j are random 
variables unlike in estimator (1.1) where N is a known 
quantity. The estimated variance of Y here is given by 


Var(Y) = (N)? Var(y) + (9)? Var(N) + 
Var (y) Var(N), (1.4) 


assuming that j and N are independent and using an exact 
result due to Goodman (1960). The estimator (1.3) is only 
an unbiased estimator if N and f are unbiased estimators 
of the population size and population mean respectively 
which is not usually the case in practice. We discuss the 
estimator (1.3) in the large pelagic fishery survey example 
in Section 3. 

The remainder of the paper is structured as follows. In 
Section 2 we review the capture-recapture literature to give 
an overview of the types of models available. In Section 3 
we present an example of a sample survey of fishing boats. 
(We consider a boat analogous to a business establish- 
ment). While this example has some unique features we 
believe it has many features common to other establishment 
surveys. In the final discussion section we summarize the 
strengths and weaknesses of using the capture-recapture 
approach to estimating frame size in establishment 
surveys. Many of our ideas will require further research. 


2. A BRIEF REVIEW OF CAPTURE- 
RECAPTURE MODELS 


It is obviously beyond the scope of this manuscript to 
review the extensive capture-recapture literature. For more 
information we recommend Seber (1982), White e¢ al. 
(1982), Pollock er a/. (1990) and Pollock (1991). Pollock 
(1991) is a review paper and a good lead into the literature 
and our treatment in this section follows it very closely. 
The other references are books and monographs for the 
serious reader with more time. 

Here we briefly discuss the Lincoln-Petersen model for 
two samples, more general closed population and open 


population models for more than two samples, and finally 
a method which combines closed and open population 
models in one sampling design. Pollock ef al. (1990, p. 9) 
presents a flow chart which shows an overview of the 
models and how they relate to each other. 


2.1 The Lincoln-Petersen Model 


This is the oldest, simplest and best known capture- 
recapture model dating back to Laplace, who used it to 
estimate the population size of France. It was first used 
in fisheries by Petersen around the turn of the century. An 
excellent detailed discussion of this model is given by Seber 
(1982,.Chapter 3). 

In the original fisheries setting the method can be 
described as follows. A sample of M fish is caught, 
marked, and released. Later a second sample of n fish is 
captured, of which m are marked. An intuitive derivation 
of the estimator follows from equating the proportions 
marked in the sample and the population, 


m/n = M/N, (ei) 
which gives 
N = Mn/m. O22) 


A modified estimator with less bias in small samples is 
due to Chapman (1951) and is given by 


N, = [(M+ 1)(n + 1)/(m+1)]-1.  @.3) 


An estimate of the variance of N.. is given by 


(M + 1)(n + 1)(M — m)(n —- m) 


Var(N.) = 
SENG) (m + 1)2(m + 2) 


(2.4) 
See for example Seber (1982, p. 60). 


The crucial assumptions of this model are: 


(a) The population is completely closed to additions and 
deletions, 


(b) all the fish are equally likely to be captured in each 
sample, and 


(c) marks are not lost or overlooked. 


The assumption about closure can be weakened, but 
even for a completely open population, where this esti- 
mator does not apply, a modification of the Lincoln- 
Petersen estimator is used. The assumption of equal 
catchability causes problems in most applications. There 
may just be inherent variability (heterogeneity) in capture 
probabilities of individual animals due to age, sex or other 
factors. There may also be a response to initial capture 
(trap response). In the next section, we consider closed 
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population models with more than two samples that allow 
for time variation as well as heterogeneity and trap responses 
in the animals’ capture probabilities. The loss or overlooking 
of marks can be serious. One way to estimate mark loss 
is to use two marks (Seber 1982, p. 94). 


2.2 Closed Population Models 


Closed population models require the assumption that 
no births, deaths, or migration in or out of the population 
occur between sampling periods. Therefore, these models 
are generally used for studies covering relatively short 
periods of time (e.g., trapping every day for 5 consecutive 
days). Capture histories for every animal caught are the 
data needed for obtaining estimates under these models. 
Important early references are Schnabel (1938) and 
Darroch (1958), who considered models that assumed 
equal catchability of animals in each sample. 

A set of models that allow capture probabilities to vary 
due to heterogeneity, (1), trap response (D), time variation 
(t), (i.e., capture probability for time / differs from that 
for time /) and all possible two- and three-way combina- 
tions of these factors is now available. The eight models 
[M(o), M(h), M(b), M(bh), M(t), M(th), M(tb), 
M(thb) | were first considered as a set by Pollock (1974) 
and were more fully developed by Otis e¢ a/. (1978), White 
et al. (1982), and Pollock and Otto (1983). Otis ef al. 
(1978) provided a detailed computer program, CAPTURE, 
for use with their monograph. An updated version provides 
estimates for seven of the eight models and a model 
selection procedure that aids the biologist in choosing a 
model. The model selection procedure is based on a variety 
of goodness-of-fits tests. Recently, Menkins and Anderson 
(1988) have emphasized that the model selection procedure 
is poor for small populations, unless the capture prob- 
abilities are unrealistically high. 


2.3 Open Population Models 


In many capture-recapture studies, it is not possible to 
assume the population is closed to additions and permanent 
deletions. The basic open population model suitable for 
this situation is the Jolly-Seber model (Jolly 1965; Seber 
1965; Seber 1982, p. 196). The Jolly-Seber model allows 
estimation of population size at each sampling time as well 
as estimation of survival rates and birth numbers between 
sampling times. Migration cannot be separated from the 
birth and death processes without additional information. 


The Jolly-Seber model requires the following assumptions: 


(a) Every animal present in the population at a particular 
sampling time has the same probability of capture, 


(b) every marked animal present in the population imme- 
diately after a particular sampling time has the same 
probability of survival until the next sampling time, 
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(c) marks are not lost or overlooked, 
(d) all emigration is permanent, and 


(e) all samples are instantaneous, and each release is made 
immediately after the sample. 


Assumptions (a), (c), and (e) were required under the basic 
Lincoln-Petersen model described in Section 2.1. Only 
marked animals are used to estimate survival rates so that, 
strictly, we do not need to assume equality of marked and 
unmarked survival rates. In practice however, the biologist 
will want to use the survival rate estimates to refer to the 
whole population. The Jolly-Seber model allows for some 
animals to be lost on capture and hence not returned to 
the population. The Jolly-Seber model also requires that 
all emigration is permanent. If animals emigrate and then 
return to the population this causes so called temporary 
emigration which is a serious assumption violation and 
causes major bias in population size estimates. 


2.4 Combination of Closed and Open Models 


Pollock (1982), Pollock et al. (1990) and Kendall (1992) 
discuss sampling methods which allow the use of closed 
and open models in one design. One advantage of these 
methods is that it is possible to allow for unequal catch- 
ability whereas in the traditional Jolly-Seber model it is 
not possible to allow for unequal catchability. They also 
have the advantage of allowing for temporary emigration 
of animals. 


2.5 Applications of Capture-Recapture Models 


Capture-recapture models have obviously been widely 
applied to wildlife and fishery populations. A variety of 
novel nonbiological applications of capture-recapture 
methods have also now appeared. Many authors have 
applied capture-recapture to estimating the census under- 
count. (See Feinberg (1992) for a complete bibliography). 
Cowan, Breakey, and Fischer (1986) used it to estimate the 
number of homeless people in a city. Greene (1983) has 
used the method to estimate demographic parameters on 
criminal populations. Wittes (1974) and Wittes, Colton, 
and Sidel (1974) have used capture-recapture to estimate 
numbers of people with illnesses from hospital and other 
lists. The sampling of elusive human populations using cluster 
sampling, network sampling, and capture-recapture sampling 
was discussed by Sudman, Sirken and Cowan (1988). 


3. USE OF CAPTURE-RECAPTURE MODELS 
IN THE LARGE PELAGIC SURVEY 


The Large Pelagic survey is an angler survey conducted 
by the National Marine Fisheries Service using a telephone- 
access survey design. A sample of fishing boat owners on 
a list are telephoned to obtain fishing effort (/.e., number 
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of fishing trips in a period) information. Catch per unit 
effort (i.e., catch per trip) information is obtained from 
a second sample of boat owners at access points at com- 
pletion of their fishing trips. The information from the two 
surveys is combined to estimate total effort and total catch 
of important species such as Bluefin Tuna. 

A serious problem with this survey is that the list of boat 
owners used in the telephone survey is very incomplete. 
Therefore, classical sampling theory which assumes a 
complete frame of known size (N) is inadequate and has 
to be modified. The current method of estimating the size 
of the fishing boat list frame involves combining two lists, 
(a telephone list with a dockside list) and using the Lincoln- 
Petersen model. There are questions about whether this 
is the best approach. For example, it might be possible 
to combine more than two lists and if so then we could 
use the closed or open population models reviewed in 
Sections 2.2 and 2.3. However, we defer those questions 
and begin by reviewing and evaluating the current method 
as an example to illustrate the potential usefulness of the 
approach to other establishment surveys. 


3.1 The Lincoln-Petersen Model 


3.1.1 Estimation of Frame Size (/V) 


Under the current method the ‘‘marked’”’ boats (M) are 
those on the master list which is primarily derived from 
previous telephone interviews. The recapture sample is 
carried out dockside at gas pumps and the total number 
of boats intercepted (7) is checked to see which ones are 
‘‘marked’’ (m) (i.e., on the original master list). Equa- 
tion 2.3 can then be used to provide an estimator of the 
frame size (V). Let us now consider the assumptions of 
this model and what effect violations might have on the 
bias of the estimator of N. 


Closure 


This assumption is likely to be violated. Fishing boats 
may be on the master list and then no longer take part in 
the fishery (losses). New fishing boats may join the fishery 
while it is in progress (gains). Ideally a separate estimate 
of frame size should be obtained for each two week time 
period. The advantage of using the Lincoln-Petersen 
closed model estimator is its simplicity and practicality. 
Biases in the estimator due to lack of closure could be 
either positive or negative. 

Currently it is not known how the fishing fleet size is 
likely to change during the fishing season. A multiple 
capture-recapture sampling design would allow use of the 
Jolly-Seber model to estimate the fleet size during each 
period. Examination of these estimators and the survival 
rate and recruitment number estimators will enable us to 
evaluate the validity of the closure assumption. At the 
moment we can only make conjectures. 


Equal Catchability 


Violation of the assumption of equal catchability may 
be due to either inherent heterogeneity of capture prob- 
abilities between individuals or ‘‘trap response’’ where 
individuals that are marked have higher or lower capture 
probabilities than unmarked individuals. In either situation 
when the individuals on the lists are fishing boats we 
believe there is a potential for heterogeneity of capture 
probabilities among fishing boats. If heterogeneity is 
operating across both samples, individuals ‘‘caught’’ on 
the first list will tend to be those with high capture prob- 
abilities and therefore they will more likely to be ‘‘caught’’ 
again on the second list. This means that the proportion 
marked in the second sample (list) will be too high and the 
estimator of N will be negatively biased. Note that this 
intuitive argument makes clear it is not heterogeneity per 
se which is the problem but the positive correlation of 
capture probabilities between the two samples. Another 
way of stating the equal catchability assumption is that 
capture probabilities in the two samples are independent. 
One method of attempting to achieve independence of the 
capture probabilities in the two samples is to use totally 
different sampling schemes for the two samples. This is 
why we recommended earlier that one sample list be based 
on the telephone interviews and the other on dockside 
interviews. However, we do suspect that there is still 
another heterogeneity and lack of independence in capture 
probabilities. We believe that fishing boats which take a 
very active part in the fishery are more likely to be on any 
lists gathered (telephone or dockside). This heterogeneity 
will cause a negative bias on the estimate of frame size but 
we have no idea of the degree of this negative bias. A more 
complete discussion of heterogeneity and independence of 
samples is given by Seber (1982, p. 86). 


Marks Lost or Overlooked 


The situation here is a little confusing. At first one 
might think that in this application there is not a way that 
amark could be lost or overlooked. However, this assumes 
that all boats have distinct names or that if boats do have 
the same name there is additional information like captain’s 
name which makes all individuals on the lists unique. If 
there is any problem with lack of uniqueness it may not 
be clear whether a marked boat has been recaptured or 
not. Another related point is that agents may make errors 
in the records which make it hard to match up a recapture 
with the original record. A standard operating procedure 
is being developed and documented to minimize these 
kinds of errors in the future. 


3.1.2 Estimation of Total Effort and Total Catch 


Total Effort (£) (/.e., the total number of fishing trips 
taken in a defined period) is estimated by 


E = Ne, (3.1) 
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where Nis the frame size (Fleet Size) estimate and @ is the 
mean fishing effort (i.e., average number of fishing trips 
taken) obtained from the telephone sample. The evaluation 
of the properties of this estimator is more difficult than 
when N is known because both N and @ are random 
variables. We suspect that @ is biased high because fishing 
boats that do not fish much are less likely to be on the list. 
Unfortunately we cannot say that N will always be biased 
high or low. All three of the assumption violations dis- 
cussed in 3.1.1 could be important (closure, heterogeneity, 
and mark loss) and it is not clear what direction the overall 
bias on N would take. The only possible approach is to 
use simulation with a variety of different scenarios for 
assumption violations. Using equation (1.4) the estimated 
variance of F is given by 


Var(E) = (N)2Var(é) + (&)2 Var(N) + 
Var (@) Var(N). (3.2) 


Total catch (C) is estimated by C = E@where Fis the 
estimated total fishing effort and ¢ is the average catch 
per unit effort calculated from the dockside interviews. 
Properties of this equation are likely to be subject to 
similar concerns as equation (3.1) and again simulation 
could be very useful. 


3.1.3 Illustration of the Method 


In this section we present the frame size estimates and 
total effort estimates for the Virginia Bluefin tuna fishery 
in part of 1992. These estimates are a part of a larger 
survey which covered the east coast of the U.S. from North 
Carolina to Massachusetts. The estimates are separate for 
charter boats and private boats. 


Frame Size Estimates 


Lists of unique private boats and charter boats were 
compiled mainly by telephone interviews from previous 
seasons. During the current 1992 season ‘‘marked’’ and 
‘‘inmarked”’ boats were captured at gas pumps before or 
after fishing trips. 

For private boats the list size was M = 335 boats before 
the season. Asample of n = 374 boats were contacted at 
gas pumps and of those m = 49 were marked. The 
Chapman estimator Ng 925105 SE(N.) = =303108 
and relative SE = 0.12. 

For charter boats the list size was M = 47 before the 
season. A sample of n = 31 boats were contacted at gas 
pumps and of thosem = 13 were marked. The Chapman 
estimator is N. = 109 with SE(N. ) = 17.88 and rela- 
tive SEe= a ORIG: 


Total Effort Estimates 


Total effort and total catch were estimated in weekly 
waves. Here we just illustrate the calculations for the week 
of the 8th to the 14th of June 1992 for total effort. 
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Total Effort - Private E Boats 

N,. = 2,519 boats, Var (N.) = = 91,856.4706, é = = 0. 15108 
trips per interview, Var(é@) = = 0.001242 and SE(e) = 
0.0352. Using these estimates we obtain 


E = WN. x € = 2,519 x 0.15108 = 380.57 trips, 


Var(£) = - Var (2) (N2) + Var (N.) (2)? + 
Var (N.) Var(é) = = 10,091.6633, and 


SE(E) = 100.45. 


It is useful to also calculate the variance of total effort 
assuming that the frame size were known. In this case it 
is Var(£) = 7,780.9384 with SE(E) = 88.77 and this 
shows that 89% of the standard error of the Total Effort 
estimate is due to variation in average effort and only 11% 
is due to estimation of frame size. 


Total Effort - Charter Boats 

For charter boats £ = 59.95 trips with Var(£) = 
512.5100 and SE(E) = 22.64. 

The variance of the Total Effort estimate assuming the 
frame size is known is Var (E ) = 404.8926 with SE Gi 
20.12. Again 89% of the standard error of the Total Effort 
estimate is due to variation in average effort and only 11% 
is due to estimation of frame size. 


3.2 More Than Two Lists 


In Section 2 we indicated that there are a lot more 
modeling possibilities if one has multiple (greater than 2) 
lists. Here we consider closed and open population models 
for the more general case. We foresee the sampling scheme 
as follows. Before the start of the fishing season there 
would be a preliminary sample to establish a list (either 
telephone or dockside). During each time period (say two 
weeks) there would be an additional list compiled using 
a telephone or dockside survey. Now each individual boat 
would have a capture history which would indicate which 
lists it appeared on. (Suppose we have five time periods 
then a capture history of 11101 would indicate a boat 
appeared on the lists in all except the fourth time period). 

The structure of the sample and the population would 
therefore be asin Table 1.The first question that has to be 
addressed is whether we need to use closed or open popula- 
tion models. The obvious way to proceed is to fit the Jolly- 
Seber open population model first and use it to evaluate 
the closure assumption. 


Table 1 
Structure of the Population Under an Open Population Model* 


Pre- : 
Season Lists 
Period si (e.g., every two weeks) 
0 1 2 3 : : : k 
Marked Population Sizes Mo M, M)> M3 , 4 : Mx 
Total Population Sizes No Ny Nd N3 : P ; Nk 


* Marked and Total Population Sizes are shown for the whole study. 


122 Pollock, Turner and Brown: Use of Capture-Recapture Techniques to Estimate Population Size 


3.2.1 Open Population Models 


Under the Jolly-Seber model previously discussed in 
Section 2.3 the following parameters are identifiable 
(Table 2). Notice that it is possible to estimate the number 
of fishing boats in the fleet at each time in the season 
except the last (i.e., NM, cannot be estimated). One advan- 
tage of applying the model in this fashion with a preseason 
list is that any concerns with the preseason list due to it 
being out of date are taken care of by the model allowing 
for additions and deletions before the season begins. One 
disadvantage of the Jolly-Seber Model is increased com- 
plexity. Now each time period has its own frame size and 
there are also survival and recruitment parameters to 
estimate. Sometimes these parameter estimates have poor 
precision unless sample sizes are large. Another disadvan- 
tage of the Jolly-Seber model is that it does require the 
assumption of equal catchability. 


Table 2 
Structure of the Jolly-Seber Open Population Model* 


Preseason Season 
Period a 
0 1 2 3 F : k-1 k 
Marked Population (Mj =0) M,; My M3 . Mr-1 - 
Total Population - Ni Ny Nz. ; Ne-1 - 
Survival Rate Po Py p2 i Ye) = 
Recruitment No. By Bo . Bro 


* Identifiable parameter estimators are shown for Marked Population Sizes, 
Total Population Size, Survival Rate and Recruitment Number. 


Another important question about the use of the Jolly- 
Seber model is what is called ‘‘temporary emigration.’’ A 
fishing boat might leave the fishery for some periods and 
then return. The Jolly-Seber model makes the assumption 
that fishing boats which leave do not return. This issue 
needs further investigation. Use of the robust design (i.e., 
combination closed and open models) allows for temporary 
emigration. This would necessitate having two lists obtained 
close together in each period. 


3.2.2 Closed Population Models 


If the Jolly-Seber model estimates of ‘‘survival’’ and 
‘“‘recruitment’’ suggest population closure (7.e., N constant) 
then the general closed population models reviewed in 
Section 2.2 could be applied. The advantages are increased 
precision of N due to the use of more lists and increased 
robustness of N to unequal catchability. The disadvantage 
is primarily an increase in complexity. 


4. DISCUSSION 


4.1 Methods of Dealing with Incomplete List Frames 


(i) Complete the List Frame 


The advantage is that the survey researcher has a com- 
plete frame and does not have to generalize results for an 
estimated frame size. The disadvantage is the cost and 
possible impracticality of completing the list frame. 


(ii) Use an Area Frame 


The advantage is that one only has to enumerate the 
establishments in the areas to be sampled. The disadvan- 
tage is possible inefficiency if businesses are sparse in each 
large area. 


(iii) Using List and Area Frame (Multi-Frame Approach) 


The advantages are obviously increased precision and 
having all establishments covered. The disadvantage could 
be expense and impracticality. 


(iv) Use of Capture-Recapture to Estimate List Frame Size 


The advantage is having a practical method of lower 
expense than the first three approaches listed above. The 
disadvantages are potential bias if the assumptions of the 
capture-recapture method are violated and having to 
include variation due to frame size estimation in variance 
estimates of population total estimates. 


4.2 Capture-Recapture Estimation of Frame Size 


In this section we consider model assumptions, precision 
of estimates, estimation of population totals and the 
special problems in more complex sampling designs when 
the capture-recapture approach to frame size estimation 
is used. 


Model Assumptions 


(i) Closure 


Can the frame size be considered constant so that the 
closed population models be used? This will depend on 
whether the survey is just a snapshot at a single time point 
or whether a series of surveys over time are required. It 
will also depend on how quickly establishments go out of 
business and how quickly new ones arise. We suspect there 
will be the need for use of closed and open population 
models depending on the establishments being studied. 

There is also the question of temporary emigration 
where establishments go out of the frame and then come 
back in again. This was considered a potential problem in 
the fishing boat example because boats could go inactive 
and then become active again. This may also be a problem 
in some other establishment surveys if establishments go 
in and out of business frequently and keep the same name 
when they come back into business. 
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(ii) ‘‘Unequal Catchability’’ and Independence of Lists 


As we discussed earlier ideally the lists used should 
be independent so that the estimates of frame size are 
unbiased. In practice it may not be easy to find two or 
more independent lists. 


(iii) Mark Loss-Unique Identification of Establishment 


Establishment names need to be unique and unmis- 
takable or matches on different lists may be missed or 
mistaken. This was a problem in the fishing boat example 
in earlier years. We suspect this will not be such a big 
problem in most establishment surveys. 


Precision of Estimates 


The lists used need to be of sufficient size that the 
precision of the frame size estimate (NV) is adequate. Seber 
(1982, p. 96) discusses the Lincoln-Petersen estimate in 
detail and presents graphics of sample sizes required for 
various levels of precision. Pollock et a/. (1990) presents 
sample size information for the open population models. 


Estimation of Population Totals 


Once the estimate of frame size is obtained then that 
estimate will often be combined with a sample mean to 
obtain an estimate of a population total (Y = Nj). The 
estimate of population total is subject to possible bias and 
additional variance because Nis estimated. The estimate 
may also be biased because j is not based on a random 
sample of the complete frame. 


More Complex Sampling Designs 


In this paper we have emphasized estimation of frame 
size in simple random sampling using the capture-recapture 
method. Further questions arise if more complex sampling 
designs are used. For example in stratified designs the 
question would arise of whether to estimate frame size 
in each stratum separately or to estimate the total frame 
size and then apportion it to the strata assuming equal 
probabilities of different strata on the incomplete lists. 
There is also the more complex question of how to esti- 
mate frame size in multi-stage sampling designs. This is 
obviously an area that needs future research. 
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Questionnaire Design for Business Surveys 


A.R. GOWER! 


ABSTRACT 


This paper provides an overview of important considerations that should be taken into account when developing 
and designing questionnaires for business surveys. These considerations include the determination of objectives 
and data requirements, consultation with data users and respondents, and methods for testing questionnaires. In 
developing and designing business survey questionnaires, focus groups and cognitive research methods help the 
researcher to identify potential sources of measurement error and to understand the response process that respondents 
go through in completing the questionnaires. Examples of focus groups and cognitive research undertaken by Statistics 


Canada are provided. 


KEY WORDS: Questionnaire testing; Focus groups; Cognitive research. 


1. INTRODUCTION 


There are many types of business survey questionnaires. 
Typically, a business survey questionnaire collects infor- 
mation about a company’s employees, its inventories, 
inputs, products, sales, and finances. It may also involve 
the collection of information related to market research 
or client satisfaction. 

Business surveys are conducted by mail or administered 
by an interviewer in person or over the telephone. Follow- 
ups to mail surveys are often conducted by telephone. 
New data collection technologies for business surveys 
involve computer-assisted interviewing, fax machines, 
touchtone self-response, and the electronic transmission 
of data. 

As in other types of surveys, questionnaires play a 
central role in the data collection process in a business 
survey. They have a major impact on data quality and 
on the image that a survey organization projects to its 
respondents. 

The purpose of this paper is to provide an overview of 
questionnaire design for business surveys. The paper 
discusses important considerations such as the determina- 
tion of objectives and data requirements, consultation 
with data users and respondents, the nature and concerns 
of business survey respondents, and methods for testing 
questionnaires. 

In developing and designing business survey question- 
naires, it is especially important to understand the response 
process that respondents go through in completing the 
questionnaires. Therefore, this paper emphasizes the 
effectiveness of using focus groups and cognitive research 
techniques to develop and test business survey question- 
naires. Examples of focus groups and cognitive research 
that have been carried out by the Questionnaire Design 
Resource Centre of Statistics Canada are provided. 


2. BUSINESS SURVEY QUESTIONNAIRES 


A well-designed questionnaire in a business survey should 
collect data efficiently, with a minimum number of errors. 
Moreover, questionnaires should facilitate the coding and 
capture of data. They should minimize the amount of editing 
and imputation that is required. They should also lead to 
an overall reduction in the cost and time associated with 
data collection and processing (Statistics Canada 1994). 

There are many considerations that apply to the devel- 
opment and design of business survey questionnaires. One 
key consideration Is the nature of the respondent popula- 
tion. Business survey respondents answer in their role as 
employers or employees of a business. How a question- 
naire is completed depends on the position and level of 
responsibility that the respondent holds in the business 
organization or company. Therefore, it is critical to identify 
the most appropriate person to provide the information 
in a business survey. 

Response burden is a very real concern for business 
survey respondents. It depends on the number of questions 
that are asked, the time required to complete the question- 
naire, and the effort that respondents put into searching 
or manipulating other data sources to provide the infor- 
mation in the format requested. 

Businesses vary in size. Large businesses may have 
employees whose responsibilities include completing govern- 
ment and survey forms. In small businesses, respondents 
are often the owners or office managers who may not have 
as much time or flexibility in their schedules to complete 
the questionnaire. 

Information provided by respondents in business 
surveys typically involves the use of records or other infor- 
mation systems. Questionnaires often contain technical or 
professional terminology associated with providing finan- 
cial or administrative data. 


! A.R. Gower, Questionnaire Design Resource Centre, Statistics Canada, Ottawa, Ontario, KIA OT6. 
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Another consideration is the confidentiality and sensi- 
tivity of the information that the questionnaire is collecting. 
In many cases, businesses are concerned about providing 
confidential financial information that they do not want 
to reveal to competitors, governments or any other party. 
Therefore, assurances of confidentiality should be pro- 
vided. All necessary arrangements should be made for the 
proper handling and custody of data in order that the 
confidentiality of information is ensured. 


3. THE RESPONSE PROCESS IN BUSINESS 
SURVEYS 


The model of the response process is well-known for 
household surveys. Answering these types of questions 
involves comprehension, retrieval, thinking/judging, and 
responding (Tourangeau 1984). Respondents must first 
understand the question. They then search their memories 
to retrieve the requested information. After retrieving the 
information, they think about what the correct answer to 
the question might be and how much of that answer they 
are willing to reveal. Only then do they give an answer to 
the question. 

A corresponding response model for business surveys 
has also been developed (Edwards and Cantor 1991). 
Although the business survey model is similar to the 
household survey model, there are differences. The major 
difference is that business survey respondents must nor- 
mally access one or more external sources of information 
such as financial or administrative records. 

The ability of respondents to retrieve the requested 
information depends upon their familiarity with and 
understanding of the external source of information. They 
must also understand the relationship between the survey 
questions and the external data source. Multiple sources 
of information may add to the difficulty or complexity of 
this task. Further complexities may be introduced if the 
respondent has to consult another individual who can 
provide the requested information and who, in turn, 
may have to use one or more data sources (Gower and 
Nargundkar 1991). 


4. DEVELOPMENT AND TESTING OF BUSINESS 
SURVEY QUESTIONNAIRES 


There are several basic steps that are involved in devel- 
oping and testing business survey questionnaires. These 
steps are discussed below. 


4.1 Determination of the Objectives and Data 
Requirements 


A document should be prepared that provides a clear 
and comprehensive statement of the survey objectives, 
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data requirements, and the data analysis plan. This docu- 
ment is a necessary step that leads to the determination of 
the variables to be measured, the survey questions, and the 
response alternatives. 

When designing the questionnaire, it is important to 
determine and understand the rationale for each question, 
how the information will be used, and whether the ques- 
tions will be good measures of what is required. 


4.2 Consultation with Clients, Data Users, Subject 
Matter Experts, and Respondents 


In formulating objectives and data requirements, 
consultation should take place with clients and data users 
to fully understand their requirements and expectations. 
Subject matter experts should be contacted for advice and 
guidance. 

If possible, the survey researcher should consult members 
of the survey population. This will help identify issues and 
concerns that are important to respondents, and may 
affect decisions regarding the content of the questionnaire. 
In addition, consultation with respondents will identify the 
language and terminology that respondents themselves use 
and will help clarify terminology, concepts and definitions. 


4.3 Previous Questionnaires 


Examining questions that were used in other surveys on 
the same or a similar topic provides a useful starting point 
in formulating the questions and response categories. In 
some situations (e.g., for comparing data over time), the 
same questions may be used. The researcher should ensure 
that the questions are phrased so as to provide valid, con- 
sistent, and effective measures of the variables of interest. 


4.4 The Use of Focus Groups in Developing 
Questionnaires 


A focus group is an informal discussion of a selected 
topic involving participants who are chosen from the 
survey population. It provides insights into the attitudes, 
opinions, concerns, and experiences of the participants. 
A focus group is led by a moderator who is knowledgeable 
about group interviewing techniques and the purpose of 
the discussion. 

Focus groups provide the opportunity to consult re- 
spondents, data users, and interviewers. In the early stages 
of developing a questionnaire, focus groups are used to 
develop the survey objectives and data requirements, to 
identify salient research issues, and to clarify definitions 
and concepts. 

Focus groups are also useful in testing and evaluating 
questionnaires (see 4.6 below). They are used to evaluate 
respondents’ understanding of the language and wording 
used in questions and instructions, and to evaluate alter- 
native question wordings and formats. 
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Recruiting participants from businesses poses unique 
challenges for focus groups. Monetary incentives or hono- 
raria that are usually offered to focus group participants 
(currently in the order of $30 to $50 each) may not be 
appropriate for business people. Assurances of confiden- 
tiality and emphasis on the importance of the survey and 
their participation in the study are more meaningful. 
Another type of incentive that may be offered is a donation 
to a non-profit organization of the participant’s choice. 
Statistics Canada often gives focus group participants a 
copy of a publication that is of interest to them. 

Focus groups vary in size from 6 to 12 persons. The 
optimum size is 7 or 8 persons for business participants, 
although smaller groups with 4 or 5 people (called mini 
focus groups or mini groups) are sometimes held. Because 
of difficulties in finding participants from businesses, 
focus groups should be conducted at a time that is conve- 
nient to the participants. For business people, focus groups 
are often held during working hours. Focus groups are 
audio-recorded, and are viewed by observers in an 
adjoining room behind a one-way mirror. Participants are 
fully informed that audio-recording is taking place and 
that they are being observed. 


4.5 Considerations in Drafting the Questions 


Many considerations go into writing the questions and 
developing the response categories. It is important to keep 
in mind the objectives and data requirements as well as 
how the information will be collected and processed. The 
questions must relate to the information needs. They must 
be addressed to the right people in the organization or 
company. 

The method of data collection will determine how the 
questions and response categories will be formulated. The 
question wording must be clear, and they must be ordered 
in a logical sequence. The questions must be designed to 
be easily understood and accurately answered by respon- 
dents. Response categories and time reference periods 
should be compatible with the business’s record-keeping 
practices; however, this is often difficult to achieve. 

The layout of the questionnaire should be attractive. 
The questionnaire should be respondent-friendly and, if 
administered by an interviewer over the telephone or in 
person, it should be interviewer-friendly. 

The questionnaire should appear professional and 
‘“business-like’’. When designing the questionnaire, it 
should be kept in mind that businesses are asked to com- 
plete many forms and questionnaires. Completing them 
is not a priority. Research conducted by Statistics Canada’s 
Questionnaire Design Resource Centre has shown that 
typical reactions from businesses to questionnaires are: 


e ‘‘I complete the shortest form first.’’ 
e “‘Is completion mandatory?”’ 
e “‘Ts there a return deadline?’’ 
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In one Statistics Canada study (Gower and Zylstra 
1990), a respondent commented that if the answer to these 
last two questions is ‘‘no,’’ then ‘‘I put [the question- 
naire] in my maybe I’ll get to it someday basket!’’ 

Respondents frequently question the value of informa- 
tion to themselves and to other users. Some like to receive 
feedback about the survey. Therefore: 


e Explain why it is important to complete the ques- 
tionnaire. 

e Ensure that the value of providing information is made 
clear to respondents. 

e Explain how the survey data will be used. 

e Explain how respondents can access the data. 


The instructions that go with the questionnaire also 
require attention. Research carried out by the Question- 
naire Design Resource Centre has repeatedly shown that 
respondents read only what they think is necessary to read. 
They read the boldface print first, and then decide whether 
they should read further. Respondents rarely read the 
instructions, and usually proceed directly to the questions. 
They refer to the instructions only when they think they 
need help. As a result, respondents may miss important 
instructions and definitions. Errors in reporting are often 
due to a lack of clear instructions and due to respondents 
not reading them or not understanding them (e.g., what 
to include or exclude). Therefore: 


e Ensure that instructions are short and clear. 

e Tell the respondent where to find the instructions. 

e Provide definitions at the beginning of the questionnaire 
or in specific questions as required. 

° Use boldface print or underlining to emphasize important 
items such as the reference or reporting period. 

e Specify ‘‘include’’ or ‘‘exclude’’ in the questions and 
items themselves (not in separate instructions). 


Other considerations that should be taken into account 
in designing business survey questionnaires include: 


e Consistency of terminology, questions and response 
categories with standard concepts and definitions. 

e Nature of the respondent population such as record- 
keeping practices and language ability. 

e Availability of the data. 

e Response burden. 

¢ Complexity of the data to be collected. 

¢ Comparability of results with other surveys. 

e Data reliability. 

e Nonresponse. 


The design of the questionnaire should also take into 
account any administrative requirements of the survey 
organization. For example, Statistics Canada’s policy on 
informing survey respondents (Statistics Canada 1986) 
requires that key information be explained to respondents. 
They must be informed about the main purpose(s) of the 
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survey, the major intended uses of the data, the require- 
ment to respond (compulsory or voluntary), confiden- 
tiality protection, and any joint collection or data sharing 
agreements. At Statistics Canada there are also other 
administrative or legal requirements. For example, the 
Official Languages Act of Canada requires that question- 
naires be made available to respondents in both official 
languages (7.e., English and French). 


4.6 The Use of Cognitive Methods in Testing 
Questionnaires 


Questionnaire testing is essential to developing effective 
questionnaires that collect useful and accurate data. Cogni- 
tive research methods, sometimes referred to as qualitative 
testing, are especially useful in testing questionnaires. 

Cognitive methods provide the means to examine 
respondents’ thought processes as they answer the survey 
questions. They are used to ascertain whether or not 
respondents understand what questions mean and thus 
help assess the validity of questions and identify potential 
sources of measurement error. Cognitive methods also 
provide the opportunity to evaluate the questionnaire from 
the respondent’s point of view. They focus on issues such 
as comprehension and reactions to the form. This brings 
the respondent’s perspective directly into the questionnaire 
design process. The use of cognitive methods leads to the 
design of respondent-friendly questionnaires that can be 
completed easily and accurately. 

In business surveys, cognitive methods are used to 
investigate the relationship between the respondent and the 
external information source. They are also used to study 
the influence that this data source has on the response 
process. These methods provide the means to assess the 
compatibility of question wording, time reference periods, 
and response categories with the business’s record-keeping 
practices. 


Cognitive testing methods (Gower 1993) include: 


e In-depth interviews: The technique involves one-on-one 
interviews (sometimes called retrospective think-aloud 
interviews). For a mail questionnaire, respondents first 
complete the questionnaire as they normally would. An 
interviewer observes the process, noting the sequence in 
which the questions are answered, reference made to 
instructions, and the types of records or other persons 
consulted. The interviewer also notes the time required 
to complete sections, and corrections or changes made 
to responses. 


The interviewer then conducts the in-depth interview and 
obtains information about the respondent’s experiences 
and impressions in completing the form. The follow-up 
discussion typically involves a question-by-question 
review of the questionnaire with the respondent to 
discuss any problems or difficulties that were encountered 


Gower: Questionnaire Design for Business Surveys 


while completing the form. The interviewer probes to 
see how the terms and concepts were interpreted by the 
respondents, how and why they chose the responses, and 
how information was recalled. 


For an interviewer-administered questionnaire, the ques- 
tions are first asked by an interviewer either in person 
or by telephone. The in-depth follow-up discussion takes 
place following this first interview. 


¢ Concurrent think-aloud interviews: These are also con- 
ducted one-on-one. The respondent is asked to ‘‘think 
aloud’’ while answering the questions, commenting on 
each question and explaining how the final response was 
chosen. The observer may probe the responses to get 
more information about a particular statement or to 
clarify the process through which a response was chosen. 


The success of the concurrent think-aloud interview 
technique depends on the respondent’s ability and will- 
ingness to articulate and express thoughts aloud. The 
observer may sometimes have to help the respondent in 
this task by gentle prompts such as: ‘‘what question are 
you answering now?”’, ‘‘what are you thinking now?”’’, 
‘“nlease explain how you chose the answer’’, or other 
probes to clarify the respondent’s thoughts. When a 
respondent is reluctant to verbalize thoughts, the 
observer may decide that the better approach is to 
handle the interview as an in-depth interview and proceed 
accordingly. 


Think-aloud interviews are very useful in obtaining 
respondents’ reactions to questionnaires. They are 
especially helpful in identifying areas of the question- 
naire where respondents have difficulty. They also help 
the researcher understand the process through which the 
questionnaire is completed. 


e Focus groups: As described in 4.4, focus groups are used 
to evaluate respondents’ understanding of the language 
and wording used in questions and instructions. The 
questionnaire is usually administered before the focus 
group session, in person, over the telephone or on a 
self-completion basis. 


During the focus group session, the moderator reviews 
the questionnaire with the participants and discusses any 
problems or difficulties that they may have encountered 
when completing the form. Focus groups stimulate and 
encourage thoughtful analysis of the questionnaire 
during group discussions of individual participants’ 
comments. They are especially useful in providing 
suggestions and recommendations for improvements. 


Paraphrasing: Paraphrasing is used in one-on-one inter- 
views and focus groups. Respondents are asked to repeat 
the question in their own words, or to explain the 
meaning of terms and concepts that are used in the 
survey questions and instructions. 
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Paraphrasing helps determine whether respondents read 
and understand the instructions and questions correctly. 
Paraphrasing is especially helpful in identifying question 
wording that is too complex or confusing. It also iden- 
tifies situations where respondents do not comprehend 
all the important components of the question (e.g., the 
reference period). 


4.7 Pretesting 


Pretesting is a fundamental step in developing a ques- 
tionnaire. It usually involves a small number of field 
interviews that are carried out to identify problems with 
a questionnaire. The entire questionnaire or only a portion 
of it may be tested. 

Pretests are useful for discovering poor question wording 
or ordering, errors in questionnaire layout or instructions, 
and problems caused by the respondent’s inability or un- 
willingness to answer the questions. Pretests are also used 
to suggest additional response categories that can be pre- 
coded on the questionnaire. Pretests provide a preliminary 
indication of the interview length and refusal problems. 

The pretest sample can range in size from 20 to 100 or 
more respondents. If the main purpose of the pretest is to 
discover wording or sequencing problems, only a small 
number of interviews may be required. More interviews 
(50 to 100) are necessary to determine pre-coded answer 
categories for open-ended responses. Respondents for 
pretests are usually selected purposively rather than 
randomly. 

The questionnaire for a pretest should be administered 
in the same way as planned for the main survey (e.g., 
interviewer-administered in person or by telephone). A 
pretest of a mail questionnaire is more effective if inter- 
viewers are used. Interviewers can be used to deliver the 
questionnaire and, later, to discuss any problems. The 
questionnaire designers should observe as many pretest 
interviews as possible. 

Pretesting is not as effective as cognitive methods in 
evaluating respondents’ understanding and the difficulty 
of the response task. Pretesting only indicates whether 
there is a problem. Without further investigation, it does 
not identify why there is a problem nor how it can be 
corrected. 


Debriefing sessions with interviewers often occur in 
conjunction with a pretest. Interviewers involved in a pretest 
can identify important problem areas where the question- 
naire can be improved. When existing questionnaires are 
redesigned, it is useful to consult interviewers to get their 
input into the redesign process. Interviewers have excellent 
insights into the logistics of administering the question- 
naire and how it affects respondent cooperation. 


Behavioral coding also can be conducted at the time of 
pretesting. The interview is audio-recorded, following 
which the interviewer and respondent behaviours during 
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the interviewer-respondent interaction are coded and 
analyzed. Behavioral coding provides a systematic and 
objective means of examining the effectiveness of the 
questionnaire. It also helps to identify problem areas such 
as an interviewer failing to read the question as worded 
or arespondent asking for clarification of the question or 
response task. 


4.8 Formal Testing Methods 


Formal testing methods are quantitative in nature. 
They are designed to provide a statistical evaluation of 
how the questionnaire performs. Pilot studies and split 
sample testing are two commonly used types of formal 
testing methods. These methods are more suitable for large 
scale and continuing surveys because of the significant 
cost involved in implementing them and analyzing the 
results. 


A pilot study is conducted to observe how all the survey 
operations, including the administration of the question- 
naire, work together in practice. A pilot study is a ‘‘dress 
rehearsal’’. It duplicates the final survey design on a small 
scale from beginning to end, including data processing and 
analysis. It allows the survey researcher to see how well 
the questionnaire performs in relation to all other parts 
of the survey. There are some problems that can only be 
identified when all phases of the survey are tested together. 
For example, typographical errors and problems with 
question wording or concepts that need further clarification 
may be identified during interviewer training. The data 
processing phase may reveal keying problems with the 
precoded item numbers and/or answer categories 
(DeMaio 1983). 

Normally, the questionnaire should be thoroughly pre- 
tested before a pilot study takes place. A pilot study is 
usually not the time to try out new questions or approaches. 
If previous testing has been carried out, it is unlikely that 
the pilot study will result in major changes to the question- 
naire. The pilot study, however, does provide the oppor- 
tunity to fine-tune the questionnaire before its use in the 
main survey (DeMaio 1983). 


Split sample testing is conducted to determine the 
“‘best’’ of two or more alternative versions of the ques- 
tionnaire. Split sample testing is also referred to as a “‘split 
ballot’”’ or ‘‘split panel’’ experiment. It involves an exper- 
imental design that is incorporated into the data collection 
process. A split sample test can be designed to investigate 
issues such as question wording, question sequencing, the 
location of sensitive items, and data collection procedures. 
In a simple split sample design, half of the sample is 
selected at random and might receive one experimental 
treatment and half, the other. In a test that involves two 
experimental treatments, a2 x 2 factorial design might 
be used with each of the two treatments in each experiment 
being tested on half of the sample (DeMaio 1983). 
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A split sample design can also be used in continuing 
surveys that assess trends over time and compare results 
across surveys. In these types of surveys, there often is a 
concern that any change in the questionnaire or procedures 
may affect other data items besides the items being added 
or revised. In these cases, a split sample design may be used 
with a random sample of the respondents receiving the 
‘*old’’ questionnaire and the rest, the ‘‘new’’ question- 
naire. Comparisons with earlier data can still be made by 
using the old questionnaire for most or part of the sample 
(DeMaio 1983). 


4.9 Review and Revision of the Questionnaire 


The questionnaire should be reviewed by someone 
outside the project team. Reviewers could include subject 
matter experts or persons who have experience in designing 
questionnaires. A review can take place at any or all stages 
of the questionnaire development process, causing revisions 
in the questions and response categories. 

Questionnaire design is an iterative process. Throughout 
the whole process of questionnaire development, revision 
and testing, changes will be made continually to improve 
the questionnaire. Objectives and information requi- 
rements are stated, evaluated and decided upon, data users 
and respondents are consulted, proposed questions are 
drafted and tested, questions are reviewed and revised, 
until a final questionnaire is developed. 


5. APPLICATION OF FOCUS GROUPS AND 
COGNITIVE RESEARCH METHODS 
TO TEST BUSINESS SURVEY 
QUESTIONNAIRES 


Statistics Canada has found that focus groups and 
cognitive research methods are very useful in developing 
and testing business survey questionnaires. These methods 
provide the opportunity to understand the cognitive 
processes involved in formulating responses to survey 
questions. They bring the respondent’s perspective directly 
into the questionnaire design process and lead to the design 
of respondent-friendly questionnaires (Gower and 
Nargundkar 1991). 

Statistics Canada’s applications of focus groups and 
cognitive research methods for business surveys include 
the developing and testing of questionnaires for the 
following surveys: 


e Survey of Employment, Payrolls and Hours (Bureau 
1991; Goss, Gilroy and Associates Ltd. 1989; Goss, 
Gilroy and Associates Ltd. 1990). 

¢ Census of the Construction Industry (Gower and Zylstra 
1990; Price Waterhouse Management Consultants 
1990). 

e Wholesale and Retail Trades Survey (Noonan 1992). 
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¢ National Training Survey (Kennedy and de Groh Consul- 
tants 1992; D.R. Harley Consultants Limited 1993). 


These studies involved the application of one or more 
of the following methods: focus groups, in-depth inter- 
views, concurrent think-aloud interviews, and paraphrasing. 
All studies were carried out under the coordination and 
general direction of Statistics Canada’s Questionnaire 
Design Resource Centre (Gower 1991). 

Each of the studies has demonstrated the importance 
of and benefits to be gained from consulting with members 
of the target population before developing and finalizing 
the questionnaire. The studies have provided valuable 
insights into the response process and have identified 
various factors that contribute to measurement errors in 
business surveys. These factors include the respondents’ 
perceived value of the information, their perception of 
response burden, the compatibility of questions with their 
record-keeping practices, the placement and use of instruc- 
tions, the availability of data, and the complexity of the 
response task (Gower and Zylstra 1990). 

Highlights from two of the studies, the Census of the 
Construction Industry and the National Training Survey, 
are discussed below. 


5.1 Census of the Construction Industry 


The annual Census of the Construction Industry was 
designed to provide comprehensive statistics on the con- 
struction industry in Canada. The target population 
consisted of establishments whose main revenue was 
derived from construction activity. There were two separate 
questionnaires for (a) General Contractors and Developers 
and (b) Trade Contractors and Sub-Contractors. The 
questionnaires, which were mailed to respondents, collected 
data on revenues and costs, labour data, and output 
distributions. 

The questionnaires used in 1988 for the Census of the 
Construction Industry were redesigned for the 1989 survey. 
The main objectives of the revision were to reduce the 
content and response burden and to respond to the need 
for major improvements to the existing questionnaires. 

A pretest of the revised questionnaires took place to 
obtain the reactions of contractors (Statistics Canada 
1989). The pretest indicated that the revised forms were 
well received and understood by respondents. Some areas 
for further improvement such as changes to question 
wording and the clarification of certain instructions were 
identified. 

To learn more about how respondents would view the 
revised questionnaires and to ensure that response rates 
and data quality would be maximized, further testing of 
the questionnaires using focus groups and cognitive 
methods was carried out in early 1990. This phase of 
testing was designed to obtain in-depth information on the 
following issues: 
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¢ How respondents felt about the questionnaires. 

e The process that respondents went through to provide 
the information. 

e The layout, presentation, and readability of the ques- 
tionnaires. 

e The extent to which respondents read and understood 
instructions and questions. 

e Problems encountered by respondents while completing 
the questionnaires. 

e Whether instructions and definitions were necessary, 
understandable, and useful. 

e The accuracy of information provided by respondents. 

e The use of estimates by respondents and their accuracy. 

e The types of records from which information was 
obtained. 

e The compatibility of the questions and response cate- 
gories with respondents’ record-keeping practices. 

e Response burden in terms of time and effort. 


The scope of the research included both the General 
Contractors and Developers questionnaire and the Trade 
Contractors and Sub-contractors questionnaire. Approx- 
imately 50 construction firms participated in the study. 
They were chosen to represent the types of respondents 
who completed the Census of the Construction Industry 
questionnaires. Twenty-five in-depth interviews, 16 con- 
current think-aloud interviews, and 2 focus groups were 
conducted in Ottawa, Montréal and Toronto. All one-on- 
one interviews took place at the respondent’s place of 
business. 

A very interesting finding from the study was that there 
were two distinct groups of respondents. The first group 
of respondents included the president or vice-president of 
a company, who often had to consult other individuals to 
complete certain questions. It took these participants 35 to 
45 minutes to complete the questionnaire. They were more 
likely to make estimates based on their familiarity with the 
company and were less concerned about accounting for 
differences between the questionnaire and the source of 
information used to complete the form. 

On the other hand, respondents such as office managers, 
accountants and comptrollers took 75 to 90 minutes to 
complete the questionnaire. These respondents were much 
more concerned with detail and providing accurate 
answers. They were more likely to use multiple sources of 
information and to make calculations in answering the 
survey questions (Gower and Zylstra 1990; Gower and 
Nargundkar 1991). 

Many respondents indicated that completing the ques- 
tionnaire was not a priority. They viewed the survey as 
only one of the many forms and questionnaires that they 
had to complete each year. Many participants indicated 
that they often waited for the follow-up telephone call, and 
some even preferred, to answer the questionnaire over the 
telephone. They said that, over the telephone, they could 
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make estimates ‘‘off the tops of their heads’’ instead of 
carefully completing the form, and this required much less 
time and effort on their part. 

The response burden was more perceived than real. 
Upon completing the questionnaire, many respondents 
remarked that it took surprisingly less time and was easier 
to complete than they had anticipated. 

A common theme that emerged during the interviews 
and focus groups was the perceived value of the informa- 
tion being collected. Respondents wanted to know the 
purpose of completing the questionnaire and often ques- 
tioned the value of the information to themselves and to 
other users of the information. Therefore, a major finding 
of the research was that the value of providing the infor- 
mation must be made clear to respondents. They wanted 
to know how the survey results were going to be used. 
They were also interested in learning how they could 
access the data. 

Overall, the questionnaires were very well received by 
respondents. They appreciated the ‘‘business-like’’ appear- 
ance and approach of the questionnaires. Many were 
familiar with completing previous questionnaires for the 
Census of the Construction Industry. They felt that the 
redesigned forms were an improvement over the previous 
versions because they seemed shorter and less complicated. 
This was positive feedback and reassurance for the survey 
managers who designed the new questionnaires (Gower 
and Zylstra 1990; Price Waterhouse Management 
Consultants 1990). 

The study identified many specific findings about how 
the questionnaires could be improved and made more 
‘‘respondent-friendly’’. While the pretest provided valuable 
feedback about response rates and the completeness of 
reporting, the focus groups and cognitive research added 
significantly to these findings by providing in-depth, first- 
hand information about How and why respondents reacted 
to the questions as well as about How and why responses 
were chosen. 

Figures 1 and 2 illustrate a few of the specific findings 
and how the questionnaire was improved based on these 
findings (Gower 1993). Figure 1 shows parts of Sections 2 
and 4 of the 1988 version of the questionnaire for General 
Contractors and Developers, before testing. Figure 2 
shows the corresponding parts of the final version of this 
questionnaire, after testing. 


Section 2 - Statement of Income 


On the final version of the questionnaire (Figure 2): 


e A statement is provided at the beginning of Section 2, 
telling respondents that they could include their com- 
pany’s Financial Statements. On the version of the form 
(Figure 1) that was tested, many respondents missed this 
instruction because it appeared on a separate page of 
instructions. 
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Figure 1 (before testing): 1988 Census of the Construction Industry (General Contractors and Developers), Statistics Canada 


SECTION 2. STATEMENT OF INCOME Dollars 
(Omit cents) 


REVENUE 


2.1 Revenue from construction contracts . 


2.2 Other operating revenue, please specify: 


Type 


2.3 Total gross operating revenue (sum of items 2.1 and 2.2). . 


2.4 Accounting method is: y completed contract 


2 : 
percentage of completion 
DIRECT COST 


2.5 Work in progress, opening (add, if required for direct cost calculation)... . 


If direct cost detail is not available, please report percentages Percentage 
of total (item 2.15, sum should equal 100). 


Sub-contracts . 


Materials and supplies used (adjusted for change in inventory) 


Wages paid to hourly-rated employees (gross, before deductions for income tax, 
pension plans, insurance, etc.) 


Direct salaries paid to site supervisors, etc. (gross, before deductions for income tax, 
pension plans, insurance, etc.) 


Employee benefits (employer contributions not included in 2.8 and 2.9, such as 
pension plans, insurance, etc.) 


undeveloped land 
Cost includes (please check): @ services, carrying charges, etc. 


: [| serviced lots 


Repair and maintenance of machinery and equipment . 


Equipment rental (without operator) 


Other direct cost . . 


Total direct cost (sum of items 2.6 to 2.14) 


Work in progress, closing (deduct if required for direct cost calculation) 


PPV Total direct cost charged to contracts (item 2.5 plus 2.15 minus 2.16) 


SECTION 4. LABOUR FORCE 
4.1 For wages paid to your hourly paid labour force, reported in item 2.8, please report hours worked: 


201 | hrs. or average hourly rate: $ | 202 / hour 


N.B.: Reported figure should be hours worked, i.e. one hour overtime paid at time and a half should be counted as one hour. 


4.2 For direct salaries paid, reported in item 2.9 please provide average annual number of employees: 


203 employees 


For overhead salaries paid, reported in item 2.19 please provide average annual number of employees: 


204 employees 
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Figure 2 (after testing): 1989 Survey of the Construction Industry (General Contractors and Developers), Statistics Canada 


SECTION 2. STATEMENT OF INCOME | 201 | | ee 


Instead of completing this section, you may include your company’s Financial Statements, together with your otherwise completed (Omit cents) 
questionnaire. If financial statements are included, go directly to Section 3. 


REVENUE 


2.1 Revenue from construction contracts . 


2.2 Other operating revenue, such as sales of materials, land sales, project or construction management, rentals of 
equipment and buildings, snow removal, consulting engineering fees. Please specify: 


Description 


2.3 Total gross operating revenue (sum of items 202 and 207-210) 


2.4 Please check accounting method used: : ia complete contract 


fs] percentage of completion 
DIRECT COSTS 


2.5 Work in progress, opening (add, if required for direct cost calculation). Work in progress is defined as inventory of 
uncompleted and unbilled construction work done 


Only if direct costs detail is not available, please estimate percentages 


: ; Percentage 
of total direct costs (item 234, sum should equal 100) 


Sub-contracts (include equipment rental with operator) 


Equipment rental without operator 


Materials and supplies used (adjusted for change in inventory) 


Wages paid to any hourly-rated employees (gross, before deductions for income tax, 
pension plans, insurance, etc.) 


Direct salaries charged to contract and paid to permanent staff, such as foremen, 
site supervisors, etc. (gross, before deductions for income tax, pension plans, 
insurance, etc.) . . 


Employer portion of employee benefits, such as pension plans and insurance. (Report 
only if employee benefits are not included in wages and direct salaries above) ... . 


Cost of land included in sales 


Repair and maintenance of machinery and equipment . 


Depreciation charged to contracts . . 


Other direct costs (any other direct costs not separately reported above, such as 
pre-constuction costs, site costs, fees, advertising, fuel, etc.) 


100 


Total direct cost (sum of items 224 to 233) 


Work in progress, closing (deduct if required for direct cost calculation) For definition of work in progress see 
question 2.5 above . 


Pago) Total direct costs charged to contract (item 213 plus 234 minus 235) . 


SECTION 4. LABOUR FORCE 4.2 Please report the average annual number of direct salaried employees 


4.1 Please report hours worked by your hourly paid labour force (whose wages were (Whose Salaries: were reported.in iter 226). 


reported in item 227): 


Exclude owners and partners of 
403 employees ; BS 
N.B.: Reported figure should be hours worked, i.e. one hour overtime paid at time unincorporated businesses 
and a half should be counted as one hour. Figures for hours worked may 


be obtained from payroll records or Workers Compensation Board reports. | 4.3 Please report the average annual number of overhead salaried employees 
(whose salaries were reported in item 237): 


| 401 | hours Exclude owners and partners of 


404 employees 


unincorporated businesses 


Only if hours worked are not available, 4 Number of professional engineers included in item 404: 
please report average (straight-time) hourly rate: 


402 |$ / hour 405 engineers 


134 


© Reference is made to line numbers (e.g., 202 and 207-210) 
instead of item numbers (e.g., 2.1 and 2.2). Although 
the line numbers are actually data code numbers, respon- 
dents viewed them as line numbers because they appeared 
similar to the common and well-known use of line numbers 
on the Canadian Income Tax forms. 

Important information such as definitions and what to 
include are provided in the items themselves instead of 
on the Instructions page. 

Respondents are only required to report estimated 
percentages if detail about direct costs is not available. 
This choice has been made clearer by printing ‘‘or’’ in 
large and bold print. 


Note that, in completing Section 2, respondents con- 
sulted the following types of records: financial statements, 
on-line accounting systems, progress or work-on-hand 
billings, project reports, general ledgers, working papers, 
and audit statements. 


Section 4 - Labour Force 
On the final version of the questionnaire (Figure 2): 


© Question 4.1 includes information that ‘‘hours worked”’ 
may be obtained from ‘‘payroll records or Workers’ 
Compensation Board reports’’. During the think-aloud 
interviews, respondents noted that they consulted these 
types of records for the information. 

Clarification is provided that ‘‘average hourly rate’’ is 
to be reported ‘‘only if hours worked are not available’. 
Important information and instructions are included in 
the question items. For example, during testing, most 
respondents did not exclude owners and partners in 
reporting the numbers of employees in items 4.2 and 4.3 
(even though this was specified on the Instructions page). 


5.2 National Training Survey (NTS) 


Two separate research studies, each involving the appli- 
cation of focus groups and cognitive research methods, 
have been used during the development and testing of the 
questionnaire for the National Training Survey (NTS). 

The purpose of the NTS is to collect information on 
employee training and development in the private business 
sector. Respondents are asked to provide data on the type 
and volume of training, the number of trainees and their 
occupational groupings, the characteristics of the busi- 
nesses providing training to their employees, and the 
amount of money being spent on this activity. In large 
businesses, respondents are the persons involved in the 
human resource planning and training areas of their 
company, while in smaller businesses they are typically the 
owner or chief executive officer. 

At an early stage in developing the questionnaire, focus 
groups and in-depth interviews were held with represen- 
tatives from small, medium and large companies. These 
methods were used because Statistics Canada felt it was 
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important to consult representatives of the business 
community to ensure that their interests and concerns 
about training were considered in the design of the NTS 
questionnaire. 

The focus groups and interviews evaluated the clarity 
and appropriateness of terminology and concepts associated 
with the training of employees within a business establish- 
ment. The study investigated respondents’ understanding 
of terms such as ‘‘formal training” and ‘‘informal training”’ 
as well as their ability to use these terms to categorize their 
training activities. 

Findings from this early phase of testing illustrated the 
importance of consulting with respondents before finalizing 
the terminology and concepts used in questionnaires. The 
findings from the study provided the survey project team 
with important information and insights into how the 
survey questions should be worded and how response 
options should be categorized. 

For example, a significant finding from the focus 
groups and in-depth interviews was that many companies 
did not use the terms ‘‘formal’’ or ‘‘informal’’ to describe 
training activities and did not see the advantage or need 
to differentiate between the two terms. Many also perceived 
that there was no clear distinction between the terms 
‘**formal’’ and ‘‘informal’’ that would enable easy cate- 
gorization of training activities. 

The study helped the survey designers understand how 
respondents interpret terms and concepts. Participants 
provided suggestions on the appropriate terminology for 
them. For example, although they had difficulties with the 
terms ‘‘formal’’ and ‘‘informal,’’ participants were able 
to provide characteristics to define these terms. They 
described formal training as having ‘‘a formal structured 
curriculum or course outline with a beginning, middle and 
an end; that it has known objectives or clearly defined 
goals; that it has an evaluation component; .... [and] that 
[it] has a dollar cost.’’ On the other hand, most partici- 
pants perceived “‘informal training’’ to be on-the-job 
training having no structure, often involving learning by 
observing. ‘‘Lack of evaluation’’ was another characteristic 
often suggested to define informal training. 

Another interesting finding was that many participants 
made a distinction between ‘‘training’’ and ‘‘developmental 
or educational activities’’. The term ‘‘training’’ was not 
seen to cover all the activities that employers provide to 
support employee development. Some participants viewed 
‘*training’’ as job-specific and related to job productivity, 
and ‘‘development”’ as related to increasing the knowledge 
base of the individual (Kennedy and de Groh 1992). 

After the draft NTS questionnaire was developed, it 
was tested using focus groups and concurrent think-aloud 
interviews. Representatives of a variety of businesses as 
well as a mixture of small, medium and large firms par- 
ticipated in the study. The study examined the following 
issues: 
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e The most appropriate person within a business to respond 
to the survey. 

¢ How best to reach respondents. 

e The process that respondents went through to provide 
the information. 

e The way in which respondents understood the questions 
and instructions. 

e Respondents’ reaction to vocabulary and the groupings 
and classifications of occupations in the survey. 

e Whether the information sought in the survey was readily 
available. 

e The types of records from which information was 
obtained. 

e The compatibility of the questions and response cate- 
gories with respondents’ record-keeping practices. 

e Whether the reference periods requested in the survey 
corresponded to the record-keeping practices of re- 
spondents. 

e Response burden in terms of time and effort. 


Seven focus groups and 26 interviews were conducted 
in Ottawa, Toronto, Montréal, and Vancouver. In the 
final report (D.R. Harley Consultants Limited 1993), the 
Contractor reported many findings and made several 
recommendations to improve the questionnaire. 

As in other studies of business surveys, a major finding 
was that many participants questioned the purpose behind 
the survey. They wanted to know why the information was 
being collected and how the survey results were going to 
be used. A strong theme that emerged throughout the 
focus groups and interviews was that respondents wanted 
to know ‘‘What’s in this for me?”’ 

Some participants suggested that the data be aggregated 
nationally, provincially and by sector so that they could 
compare themselves to other companies in their areas of 
business and in their part of the country. As one respondent 
said, ‘‘I would want the data to be specific to our industry 
with the volume and type of training that’s being provided 
.... It should allow us to compare ourselves to others in 
our sector - number of employees being trained and the 
percentage of payroll being spent on employee training.”’ 

Many small and medium-sized business respondents 
found the questionnaire too broad and the level of detail 
too complicated for them to answer. In their opinion, the 
questionnaire was designed for larger organizations. For 
example, many small businesses felt that they could not 
fit themselves into the categories provided by the question- 
naire. They felt that much of their training fell into the 
‘‘unstructured’’ category, and that the questionnaire was 
not capturing this aspect of training. However, at the same 
time, there were other respondents from small and medium- 
size businesses who commented that the questionnaire was 
thorough and complete. 

The larger businesses also had difficulty with the level 
of detail being requested by the survey. The major problem 
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was that they keep training records by type of training that 
employees receive rather than by the occupational category 
of the people being trained. 

Overall, a variety of record-keeping practices were 
observed. Some businesses keep excellent records on 
training, while others do not. Participants, who did not 
keep good records or whose records did not contain the 
requested information, found the questionnaire difficult 
to answer. Others, who had sophisticated records, could 
manipulate their data to fit the questionnaire. The one 
exception was the questions on training expenditure for 
which they found it difficult to provide detailed informa- 
tion. Global figures were more easily available, they said. 
Many businesses indicated that their training records were 
not centralized, thus making the questionnaire more 
difficult and requiring longer time to complete. They said 
that they would complete what they could, and then coor- 
dinate the completion of the rest of the questionnaire by 
forwarding it to many parts of their organization. 

Although many participants were initially overwhelmed 
by the size and apparent complexity of the questionnaire, 
they found it easier to complete than expected. Many 
found that the thoroughness of the questionnaire actually 
made them remember many training activities that they 
would not ordinarily have reported on. 

Most participants felt that the questionnaire should be 
shorter. But they also suggested adding a few more open- 
ended questions about future training. In terms of response 
burden, respondents (especially in medium-sized and 
large-size companies) found that the questions about 
training expenses, training hours, and the numbers of 
employees trained by occupational categories would 
require hours of work to compile. 

Differences were found in the time it took respondents 
to complete the questionnaire. Small businesses took 
between 10 minutes and | hour to complete the question- 
naire. Large businesses, on the other hand, estimated that 
it would take about 2 hours to complete the questionnaire 
(D.R. Harley Consultants Limited 1993). 


6. CONCLUDING REMARKS 


This paper has provided an overview of questionnaire 
design for business surveys. As the paper has pointed out, 
many considerations go into designing business survey 
questionnaires. They include the survey’s objectives and 
data requirements as well as consultation with data users 
and respondents on the nature and concerns of the respon- 
dent population. Other considerations are response 
burden, the method of data collection, the availability of 
data, and the use of records, as well as the need for testing 
the questionnaires. 

Specific design issues that should be taken into account 
include the instructions, the clarity and readability of the 
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questions, the logical sequencing of the questions, the 
compatibility of response categories and reference periods 
with respondents’ record-keeping practices, and data 
processing requirements. The questionnaire should be 
respondent-friendly and interviewer-friendly. 

To ensure the collection of accurate and useful data in 
business surveys, it is important to understand the response 
process that respondents go through in completing a 
questionnaire. Focus groups and cognitive research 
methods are very effective ways to study this response 
process and to test questionnaires. They provide the 
opportunity to consult directly with respondents and, 
thereby, to bring their ideas, concerns, and suggestions 
into the questionnaire design process. 

Looking towards the future, research and experience 
should lead to improvements in the methods and approaches 
that are currently used to develop and test business survey 
questionnaires. An important area that requires more 
research and development is the relationship among the 
questionnaire, the respondent, and the external informa- 
tion source as well as the influence that this relationship 
has on the response process and the accuracy of reporting. 
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Bias Corrections for Survey Estimates from Data with Ratio 
Imputed Values for Confounded Nonresponse 


E. RANCOURT, H. LEE and C.-E. SARNDAL! 


ABSTRACT 


Most surveys suffer from the problem of missing data caused by nonresponse. To deal with this problem, imputation 
is often used to create a ‘‘completed data set’’, that is, a data set composed of actual observations (for the respondents) 
and imputations (for the nonrespondents). Usually, imputation is carried out under the assumption of unconfounded 
response mechanism. When this assumption does not hold, a bias is introduced in the standard estimator of the 
population mean calculated from the completed data set. In this paper, we pursue the idea of using simple correction 
factors for the bias problem in the case that ratio imputation is used. The effectiveness of the correction factors 
is studied by Monte Carlo simulation using artificially generated data sets representing various super-populations, 
nonresponse rates, nonresponse mechanisms, and correlations between the variable of interest and the auxiliary 
variable. These correction factors are found to be effective especially when the population follows the model 
underlying ratio imputation. An option for estimating the variance of the corrected point estimates is also discussed. 


KEY WORDS: Conditional bias; Monte Carlo simulation; Restoring estimator; Variance estimation. 


1. INTRODUCTION 


Occurrence of nonresponse is rather a norm than an 
exception in surveys. Missing data caused by nonresponse 
are often imputed to obtain a completed data set and the 
standard estimator is applied to the completed data set 
assuming that the underlying response mechanism is 
unconfounded. However, a point estimate obtained in 
such a way is biased when the response mechanism is 
confounded. The bias in this case could be very severe as 
pointed out in Lee, Rancourt and Sarndal (1994). A 
response mechanism is unconfounded, according to Rubin 
(1987, p. 39), if it does not depend on the variable under 
study, otherwise it is confounded. (A formal definition 
suitable for this paper will be given in Section 2.) 

In a Bayesian framework, a concept similar to that of 
an unconfounded response mechanism is termed ignorable. 
For bias caused by a nonignorable response mechanism, 
Rubin (1977, 1987) and Little and Rubin (1987) considered 
a method to correct the respondent mean using auxiliary 
variables. In this approach, a linear regression is assumed 
between the variable of interest y and a vector of auxiliary 
variables x. The regression coefficient vector for the 
nonrespondents is assumed to have a normal prior with 
mean equal to the regression coefficient vector for the 
respondents. 

Assuming a logistic model for the response probability, 
Greenless, Reece and Zieschang (1982) proposed a method 
to deal with nonignorable nonresponse using maximum 
likelihood estimation. Further, a linear regression model 
is assumed for the relationship between y and x, a vector 


of auxiliary variables. The logistic model of the response 
probability includes y and z, a vector of other auxiliary 
variables. Assuming also that the error term of the regres- 
sion is normally distributed, they obtain maximum likeli- 
hood estimates of the unknown parameters of the regression 
model and the logistic model. Finally, for a nonrespondent, 
an imputed value is calculated as the mean of the distri- 
bution of y conditional on the values of x and z for the 
nonrespondents, and the estimated parameters. Such a 
method may give good results when all the model assump- 
tions are satisfied but is likely to be highly sensitive to the 
specifications of the two models. The adequacy of the 
response probability model is usually untestable. If data 
are available from an external source, however, then it 
may be possible to test the response probability model as 
Greenless et a/. did in their application to the Current 
Population Survey data. This method is highly computer- 
intensive. 

In the case of categorical data, a few methods have also 
been proposed to deal with the problem of nonignorable 
nonresponse. For instance, Baker and Laird (1988) try 
to model the response mechanism with the help of log- 
linear models. As well, causal modeling is discussed in Fay 
(1986, 1989). 

Ratio imputation is often used at Statistics Canada, 
especially in repeated surveys. For instance, in the Monthly 
Survey of Manufacturing, a missing value of the current 
shipment is imputed by ratio imputation using previous 
month shipment as the auxiliary variable value. This 
simple method is very appealing to subject matter 
specialists because it reflects month-to-month movement. 


! EB. Rancourt and H. Lee, Business Survey Methods Division, Statistics Canada, Ottawa, Ontario, Canada, K1A OT6; C.-E. Sarndal, Département 
de mathématiques et de statistique, Université de Montréal, C.P. 6128, succursale A, Montréal (Québec), Canada, H3C 3J7. 
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In this paper, we investigate the possibility of improving 
the estimator applied to data containing ratio imputation 
with the aid of simple correction factors. Therefore, we 
assume that imputation has already been performed, and 
try to correct the estimator. We focus our attention on the 
estimation of the mean. The use of simple correction 
factors would be very appealing to the user provided it 
works reasonably well. Such a procedure is also easy to 
implement without resorting to excessive computational 
efforts and it enables us to avoid explicit modeling of the 
nonresponse mechanism. However, our approach differs 
from Rubin’s in that we use sample dependent correction 
factors rather than an a priori chosen constant. 

In Section 2, we define several simple correction factors 
that meet our requirements. In Section 3, we propose a 
variance estimator that may be used in conjunction with 
the corrected point estimators. The properties of the 
corrected point estimators were examined by a Monte 
Carlo simulation reported in Sections 4 and 5. Section 6 
presents some concluding remarks. 


2. SIMPLE BIAS CORRECTION FACTORS 


herUi= 1 Vinee knee NRidenotetheundexesetonta 
finite population and let the population mean of the 
variable of interest y be denoted by Jy = (1/N) Yuyx. 
We assume that y, > 0 for all k € U. From U, a simple 
random sample s of size n is drawn without replacement 
(SRSWOR). The unbiased estimator that would be used 
with 100% response is the sample mean 


Js = (1/n) YY yy. (2.1) 


Let r and o be the sets of the responding and non- 
responding units, respectively, so thats = r U o. Wedenote 
the SRSWOR sampling plan by p( -) and the response mech- 
anism givens by g(- | s). That is, p(s) is the probability 
that the SRSWOR sample s is drawn, and q(r | s) is the 
probability that the set rresponds given the samples. Let 
also m and / be the sizes of r and o, respectively. For 
simplicity, we assume that the probability of m = 0 Is 
negligible. We assume that imputation is carried out with 
the aid of an auxiliary variable, x, whose value, x;, is 
known and positive for allk € s. If k € o, the missing value 
y, is imputed by ¥,. The completed data set is denoted as 
{y.,:k € s} wherey., = y,ifk € randy., = py, ifk € o. 

In this paper, we examine ratio imputation. This often- 
used imputation method is based on a simple model. That 
is, if the value y, is missing, it is imputed by B,x,, where 
B, = (¥,,)/(¥,-X~). The model denoted £, is stating 


that, for k € s. 
Vie = BX + Ex, Es (€e| xe) =0, Viley| x) =o xp, 


EB: (€,€; | 2G -2.4)) = 0, kaa (22) 


Under this model, B,x; is the best linear unbiased predic- 
tor of the missing value y,;, based on the respondent data 
{ (¥e5Xx) 1k € r}. The completed data set is then composed 
of the values 


Ves Pear oa keer 
Soft de 23 
ei ca eo. i 


The customary procedure is to apply the estimator 
formula used for 100% response to the completed data set. 
This gives 
jy, 


»y VES ake ze Jraimp> (2.4) 
Ss Ae 


Disc 


a | 


where X,. = (1/1), )) 5%. J, = (1/m) Y yy, and x,.= (1/m) 
¥,x,. Note that raimp stands for ratio imputed. 

It now becomes necessary to address the question 
whether the imputation can restore the full response esti- 
mator, y,, in the sense that the imputation estimator J., 
is equal to ¥, in expectation given s. Unless this can be 
achieved, the ratio imputation will have introduced bias. 
To examine this question, we must consider the response 
mechanism. A response mechanism qg(- | s) is said to be 
unconfounded for the purpose of this paper if it is of the 
form. g(r | s) sag (x.)s where. *= lore? kecesticane, 
the response probabilities satisfy P(k€r|s) > 0forall 
k € s. That is, it may depend on s and on the associated 
x-values. If it depends also on the y-values, so that 
q(r|s) =q(r | X,,¥,), then is is called confounded. In 
these definitions, the response mechanism is conditional 
on the realized sample s. Slightly different definitions of 
‘‘confounded”’ and ‘‘unconfounded”’ are given in Rubin 
(1987, p. 39) where they are unconditional. 


An example of an unconfounded response mechanism is 


qiris)=JTTa-%) J] Ox, 


ker kés—r 


where 0, = 1 — P(k€r|s) = 1 —- e * for some 
positive constant y, is the nonresponse probability of unit 
k. By contrast, if 09, = 1 — e  Y*, then q(r|s) isa 
confounded mechanism. 

A particularly simple unconfounded mechanism is the 
uniform response mechanism defined by g(r|s) = 
(1 — 8)'"6"~’". Here, units respond according to inde- 
pendent and identical Bernoulli (1 — 9) trials, where O 
is the nonresponse probability common to all units. 

Whether an imputation estimator jy of fy, including 
Jraimp given by (2.4), is considered good depends in part 
on the assumptions made by the analyst about the response 
mechanism and in part on the relation between y and x. 
Several possible assumptions are discussed later in this 
section. For any given s, the goal is that, under specified 
realistic assumptions, the expectation of the difference 
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Ju — jy, should be close to zero. That is, under the given 
assumptions, the conditional bias of fy, C-bias(Fy) = 
E(¥y — ¥, | Ss), should be small. We call yy a restoring 
estimator of Yy if C-bias (Jy) = O’or = Omthat 1s* if Py 
is (approximately) equal to ¥, in conditional expectation. 
It follows that if the C-bias is (approximately) zero for any 
s, then the unconditional bias over all sample realizations 
s is also (approximately) zero. 

Different analysts make different assumptions. Let us 
consider some typical assumptions and ask the question: 
What restoring estimators do these assumptions allow? 


Assumption I: The response mechanism is uniform. 


Under Assumption I, Y,aimp 1S a restoring estimator. To 
see this, note that 


C-bias (Y;aimp) = EG (F:aimp | S) a Vs | 0, 


because, given S, Vraimp 1S the classical ratio estimator of 
y,. Assumption I is unrealistic in most surveys. The 
response propensity is known to vary with observable 
characteristics such as size and industry (for business 
establishments), family size and type (for households), 
age, sex and income (for individuals). Under this unrealistic 
assumption, even a naive estimator such as the respondent 
mean, j, = (1/m)¥,y,, is a restoring estimator: 


C-bias(¥,) = E,(¥, | 8) — Js = 0. 


However, if Assumption I holds, Y,aimp is preferred to y, 
because the ratio estimator feature leads to a smaller 
variance if the model — holds. 

The analyst clearly needs to consider more realistic 
assumptions which allow the response probabilities to vary 
with background variables. The following assumption, 
composed of two parts, is of this kind. 


Assumption II: (II-1): the response mechanism is uncon- 
founded but otherwise arbitrary; 


(II-2): the ratio model (2.2) holds. 


Here (II-1) is a weaker and more realistic requirement 
on the response mechanism than the uniformity requirement 
in Assumption I. Under (II-1), the response mechanism 
can be of any form as long as it is unconfounded. How- 
ever, Assumptions I and II are not directly comparable 
since II contains a model component, (II-2), which is 
lacking in I. Under Assumption II, ,aimp 1S a restoring 
estimator because 


C-bias(Vraimp) = Ez{Eg(Vraimp) — Vs | $3 


Ue 7 
Ey; (22) i“ E; (Vs) 


r 


E,(BxX;) — BX; = 0. 
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Note that changing the order of the expectations, E; E, to 
E,£:, is allowed under Assumption II, because the 
response mechanism is then of the form g(r | x,), that is, 
it does not depend on the y-values. By contrast, the 
respondent mean V, 1s not a restoring estimator because 


C-bias(j,) = E{ EG (7) rsp | s} = BLE, (x, | SJ Xl. 


which is generally nonzero under Assumption II. We can, 
however, transform Jj, into a restoring estimator by the 
use of a multiplicative correction factor. This leads to 


L0-DE-) es 


which is just another way of writing Y,aimp, aS can easily 
be verified. In an example using the Bayesian approach, 
Little and Rubin (1987, p. 233) arrive at an estimator iden- 
tical to the estimator (2.5). 

Let us now consider confounded response mechanisms. 
They cause more difficult problems for finding a restoring 
estimator. 


Assumption III: (III-1): the response mechanism is con- 
founded but otherwise arbitrary; 


(III-2): the ratio model (2.2) holds. 


It is usually difficult, if not impossible, for the analyst 
to decide whether Assumption II or Assumption III is 
more appropriate. Examining the data will not be of much 
help if the only data available relate to the present point 
in time, as would typically be the case in a one-time survey. 
The assumption made (whether II or III) is then unveri- 
fiable. By contrast, if the analyst has experience with a 
regularly repeated survey, he or she may have legitimate 
reasons to believe, for example, that the nonresponse is 
a function of the variable of interest. 

In some situations, the assumption of a confounded 
mechanism may be made on the following grounds. Sup- 
pose in a survey of personal finances that y, the variable 
under study is ‘‘savings’’ and that x, the auxiliary variable 
is ‘‘income’’, with values x, known for the individuals 
k € s. The nonresponse probability of respondent k is 
likely to be correlated with the savings figure y, that he 
or she is asked to reveal as well as with the income figure 
x, known from other sources. But since savings, not 
income, is the variable with which the respondent is 
directly confronted in the survey, the assumption that the 
nonresponse probability is a function of y, may be more 
realistic than the assumption that it is a function of x,. 
Hence a confounded mechanism may be more realistic to 
assume than an unconfounded mechanism. 

Under Assumption III, neither ¥, nor Y,aimp are restoring 
estimators. The C-bias of J,aimp cam be expressed as 
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C-bias (Vraimp) a XE: Ey 


where €, is defined by the model (2.2). This C-bias is 
generally nonzero and can be quite large when the non- 
response rate is high and the correlation is not so strong. 
However, the C-bias is hard to evaluate, since the exact 
form of the response mechanism is left unspecified. Note 
that changing the order of the expectations F; and E, is 
not permitted under Assumption III since g(r | s) depends 
on the y-values. For example, a negative C-bias is likely 
to occur if the respondent residual total, ).€, tends to be 
negative. 

A confounded response mechanism (as in Assumption 
III), introduces bias in the slope estimator B, = (¥,),)/ 
(¥,x,). Consequently, B,x;, is a biased imputation for a 
missing value y,. To improve the situation, suppose that 
a missing value y, is imputed by CB,x;, instead of B,x,, 
where Cis a quantity to be specified. Then the data after 
imputation are given by 


F Wien if ker 
yistoa nh anne oe (2.6) 
CB Sit Keo 
and denoting the sample mean of these data as j.., = 
(1/n) ¥,y‘S,, we get the estimator 


ee v,|1 ee (: z s fe cs 1) | (2.7) 


A simple correction of the type used in (2.6) was mentioned 
in Rubin (1986; 1987, p. 203) in the context of multiple 
imputation. Rubin views Cas a fixed constant chosen by 
the user according to his or her prior knowledge. If such 
a choice happens to be well founded, the bias of (2.7) may 
be small. 

Here, we shall examine choices of C that are adaptive, 
that is, they reflect the realized sample s and the realized 
response set r. Ideally, C should be such that the imputa- 
tion will exactly restore the estimator ¥, = (1/n) Yo yx 
that would be used with 100% response. This C-value is 
determined by the equation 


1 ] , ‘ 
J te iy ye = 3 ee lia “(a nn ¥ Cb.x:). 
iy Ss tf, oO 


A simple calculation shows that the optimal C-value is 


where B, = ¥,¥x/ V.oX, is the slope estimate if the model 
(2.2) could be fitted to nonrespondents. The imputed 
values would then be 9, = B,x, for k € o. Obviously, 
Copt and B,, cannot be computed since they depend on 
missing y,-values. For an unconfounded mechanism (as in 
Assumption IT), we can expect Cy, ~ 1, givens, because 


B 
E-Ey( Cope \iS) ea abg He ice | :) = 


r 


But for a confounded mechanism (as in Assumption ITI), 
Cop: can be distinctly away from unity. Suppose that 
Cop: > 1. Note that €,,, > 1 if and only if.) e;,.< 0 
with e,, = yy — Byx,, where B, = (¥¥x)/(Y5%X;) is the 
unknown slope estimate with 100% response. That is, 
Copt > 1 implies that respondents’ residuals e;,, are 
negative on the average. An illustration of this is shown 
in figure 1, wheren = 10,/ = n — m = 5, and all five 
respondents’ residuals e;, are negative. 


Respondents 


Figure 1. Example of data plot (y,,x,) for a confounded 
response mechanism. 


Assuming that C,,, > 1, one approach for the analyst 
working under Assumption III is to choose a computable 
C likely to satisfy C > 1 and then use this C to construct 
the estimator (2.7). Factors C that will sometimes work 
in this manner are 


(2.8) 


= |= 


a 


They are based on the logic that if the response mechanism 
is confounded in such a way that the nonresponse proba- 
bility is a function of y (for example, 0, = 1 — e 
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with y > 0), then both Cy, > 1, and x, > ¥X, are likely 
to occur, as Figure | illustrates. Conversely, if nonresponse 
is a decreasing function of y,, then both Cy, < 1, and 
X, < X,are likely to occur. 

One important feature of such correction factors is that 
they can, but need not, be calculated during the imputation 
phase. For instance, if the usual ratio imputation B,x, 
was carried out at the imputation phase, it is then possible 
to calculate a suitable correction factor at the estimation 
phase without changing the originally imputed values. 

Note that c, implies a somewhat milder correction than 
Ceurcy, = i, wehave b'< ¢ =<’ cy The choices.C = ¢, 
and C = cy, are calculated on the ranks of the x-values, 
rather than on the x-values themselves, to dampen the 
effect of extreme x-values. More specifically, letting w, 
be the rank of x, in the data set {x,:k € s}, the w-means 
inc; and cy,arew, = (1/n) YW, W, = (1/m) ¥,w, and 
WwW, = (1//) ¥,w,. The four estimators obtained by 
letting C = c¢; in (2.7) according to (2.8) will be denoted 


aS ez-ss i= 1,..., 4. In particular, we have 
sax sa, m Xo 2 
Vase J,| b+ (1 — = == — 1 - (Qy 
n 2a 
and 


(2.10) 


2p 
m Xx, 
Von = y,|1 1 (: i ) ee ai i}. 
YN CR 


The correction factors given in (2.8) are not ideal when 
the correlation between x and y is close to 1. In this case, 
we have B, =~ B, ~ B,, provided that the model (2.2) 
holds. Therefore, the correction factor C should be close 
to 1. However, the correction factors given in (2.8) could 
be very different from | and using them would bring bias. 
For this reason, it may be preferable to work with a 
correction factor Cin (2.7) that takes the correlation into 
account. Correction factors of this kind are 


eG eel) Rye al) (2.11) 
where c;, i = 1, ..., 4, are the four correction factors 


given in (2.8), and ies is the estimated correlation coef- 
ficient based on the respondent data. In our Monte Carlo 
simulation we also included the estimator (2.7) corre- 
SPOndins tothe LOuUn ChOICes @ — Kn 7e— ll, aens4., 
These estimators will be denoted as Sk joss i Hla se ore ee 


3. VARIANCE ESTIMATION 


Since we are interested in variance estimators based on 
single value imputation, the variance estimation method 
proposed in Sarndal (1990, 1992) is of interest. Assuming 
unconfounded nonresponse and that the model € in (2.3) 
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holds, the variance estimator for the point estimator 
Yraimp In (2.4) obtained by this method is given by 


Ik 1 1 y (Y-% oF ya) 
V p i = -_—_— — a a a eee ts 
(Yraimp) (« x) sng 
Trea jg aR ere 
Sel = = SSA Sel = = Alias 
n N m n 
ca hone aE Voir + Vie (3.1) 
where 


and 


a, Ch (i 1) 
¢ = +_—________, (3.2) 
aah bates (CV,,)*/m} 


| LD (te — %)2/(m = 1) 


xX 


where 


he Sh Me = BX; Vip = 


The variance Of Y;aimp has two components, namely, 
the sampling variance and the variance due to imputation. 
The first term in (3.1) (denoted by V.,q) is an estimate of 
the sampling variance calculated using the ordinary 
variance formula assuming that imputed data are as good 
as real observations. Since this assumption does not hold, 
V,,.q underestimates the true sampling variance. To correct 
this underestimation, the second term V4; in (3.1) is added. 
The last term Veep in (3.1) is an estimate of the variance 
due to imputation. 

If we compute the mean of the y-values from the com- 
pleted data set {y°,:k € s} given in (2.6), we get the 
estimator (2.7). Its variance estimator should take the 
correction factor C into account. If we can assume that 
the expectation EF; E,,F, is equal to E,E,£E; (this is true 
under unconfounded nonresponse), we can use Sarndal’s 
(1990, 1992) method to obtain a variance estimator which 
takes C into account. However, we are mainly interested 
in confounded cases. We are therefore proposing a variance 
estimator based on the following heuristic argument. 
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The estimator 67 in (3.2) uses the respondent data only. 
It will certainly be biased for confounded mechanisms and 
some correction is needed in order to use formula (3.1) for 
the corrected estimator (2.7). We suggest to replace 67 in 
(3.1) by C6’, to obtain the following variance estimator 
for the estimator y,., in (2.7): 


V (Ves) a Vo - C? (Vaig + Lipo (3.3) 


where V’S.4 is computed using the data after imputation 
with the bias correction factor C. Replacing C? by c? or 
k?, we obtain the variance estimators corresponding to 
Vej-s OF Vyj.5. The resulting variance estimators work quite 
well in many of the cases covered in the simulation reported 
in Section 5. 


4. SIMULATION STUDY 


We are considering eight corrected estimators corre- 
sponding to the eight correction factors given in (2.8) and 
(2.11). A simulation study was conducted to determine 
whether the corrected estimators succeed in restoring j, 
under different response mechanisms, in particular, con- 
founded mechanisms. For comparison, we also included 
the uncorrected estimators ¥, and Yraimp = ¥s),/X; given 
by (2.2). Our primary objective was to examine the cor- 
rected estimators when the finite population follows the 
ratio model € given by (2.3). However, we also wanted to 
see how the corrected estimators behave under relationships 
other than linear regression through the origin. 

We also studied the coverage rates associated with the 
different estimators when the confidence intervals are 
computed with the aid of the variance estimators proposed 
in Section 3. 

For the simulation, we generated 12 different finite 
populations, each of size N = 100, by specifying in differ- 
ent ways the constants a, b, c, and d in the regression 
model: 


Ay, = a + by FOE + €,) Eee) = 0, 


Va (Ey). =! d?xz; (4:1) 


where the €, are assumed to be independent. Four differ- 
ent regression types were created by four different speci- 
fications of (a, b, c). These types are called RATIO 
(a = c = 0,5 > O, thus conforming to the ratio model 
£ in (2.3)), CONCAVE (a = 0,b > 0,c < 0), CONVEX 
(a = 0,5 > 0;c > 0) and NONRATIO (a@ + 036 > 0, 
c = 0). For each regression type, three different levels of 
the model correlation p,,, 0.7, 0.8 and 0.9, were obtained 
by a suitable choice of d. This resulted in 12 specifications 
of (a, b, c, d) as shown in Table 1. 


Table 1 
Characteristics of the Populations 


POP TYPE pe A ¢ dt RRS ey 
y 

1 RATIO ‘head be 06.12" "0169 F095 
2 RATIO ar lays 0 4.50 0.81 69.92 
3 RATIO Oia les 0 2.91 0,90... 72.67 
4 CONCAVE 0 3 10,01. + 6.78 Os7lay 112.27 
yp eCONCAVET. L0e 4 -0.01 4.83 0.81 114.57 
6 CONCAVE 0 3 ~0.01 2.80 0.90 112.11 
7.\) (CONVEX tb! 1025>10.0.0R215.98 -or7it) 2195489 
g 8 “CONVEX 6/050.25 0.01. 498 “O:81-"  a7k06 
9. SCONVEX, “il 90.25: 710101» 12:35 S00 e482 
10 NON-RATIO 20 1.5 0. 612 O71 « 9535 
11 NON-RATIO 20 1.5 0 4.50 0.81 94.46 


12 NON-RATIO 20° 1.5 Dy 2S OS 93e02 


For each of the 12 specifications, we generated 100 
population values (),,x,), kK = 1, ..., 100, by a two 
step process. We used the [’-distribution with parameters 
a and #. Its density is 


l —1 
Xe EXP (XG ioe se SO, 4.2 
ec p(— x/B) (4.2) 
First, we generated 100 values x,, kK = 1, ..., 100, 
according to the I’-distribution with parameters a = 3, 
G = 16, implying that the mean is a8 = 48 and the 
variance a6* = 768. Then, for each fixed x,, k = 1, 
..., 100, we generated one value y,; according to the 
I’-distribution with parameters 


et A) ee ee eg) (4.3) 
82 X)earl d’x é 


2 2 
B= ao“ (x) . (aes " (4.4) 
p(x) a+ bx + cx 


where x = x, and (a, b, c, d) is one of the 12 vectors 
fixed in advance. This implies that Fz(), | x.) = aB = 
@ xt bx, +.cxe.and Val Vial Xe) = OG =a vee 
required under the model (4.1). The same x-values were 
used for all 12 populations. For the populations generated 
by this process, Table 1 shows the values of the population 
correlation R,, and the population mean of y. Note that 
the values of a, b, c, and d were chosen so as to obtain 
realistic types of populations that can be encountered in 
practice. 

To simulate nonresponse, we used five different 
nonresponse mechanisms, each defined by independent 
Bernoulli (0,;) trials, where the probability of non- 
response ©, for unit k was specified as follows: 
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(M1) ©, is constant and independent for all k € U. This 
is the uniform response mechanism, therefore un- 
confounded. 


(M2) 0, is a decreasing function of x, specified as 0, = 
exp( — yx,). This is an unconfounded mechanism. 


(M3) ©, is an increasing function of x, specified as 8, = 
1 — exp(—yx,). This is also an unconfounded 
mechanism. 


(M4) 0, is a decreasing function of , specified as 0, = 
exp(— yy,). This is a confounded mechanism. 


(M5) ©, is an increasing function of y, specified as 
0, = 1 — exp(—yy,). This is also a confounded 
mechanism. 


Note that since we assume x and y to be positively 
correlated, both (M2) and (M4) are mechanisms such that 
large units respond more often than small units. The 
smaller units will be underrepresented in the response set r. 
Conversely, (M3) and (M5) are mechanisms such that 
small units respond more often than large units. The larger 
units will be underrepresented in the response set r. 

The first mechanism corresponds to the naive Assump- 
tion I discussed in Section 2. (M2) and (M3) correspond 
to Assumption II while (M4) and (MS) represent fairly 
simple examples of the confounded mechanisms discussed 
_ in connection with Assumption III. For (M2), (M3), (M4) 
and (M5), the constant 7 was determined in such a way 
that the average nonresponse probability © = (1/N) 
¥yO,, is equal to one of the values 10%, 20%, 30% 
and 40%. Therefore, for each population, there were 
5 x 4 = 20 different combinations of nonresponse 
mechanism and nonresponse rate. 

For each of the 12 populations, 1,000 samples of size 
n = 30 were drawn. Then for each realized sample, 
50 response sets were generated using independent 
Bernoulli(©,) trials according to one of the 20 combina- 
tions of nonresponse mechanism and nonresponse rate. 
Thus 50,000 response sets were realized for each of the 
12 x 20 = 240 combinations resulting from cross- 
classifying the 12 populations with the 20 combinations 
of nonresponse mechanism and nonresponse rate. 


5. RESULTS 


We studied the two uncorrected estimators jy, (justified 
under Assumption I) and J,aimp = X¥,/X, Gustified under 
Assumption IJ) and the 8 corrected estimators y,;., and 
Vxiess i = 1, ..., 4 Gustified under Assumption IIT). (We 
call both ¥, and J,aimp uncorrected even though (2.5) shows 
that we can view J,imp aS a corrected version of the naive 
estimator j,. Recall that our principal aim is to correct 
the bias of Y,aimp When the mechanism is confounded.) 
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The performance of the 10 estimators is judged by the 
magnitudes of the relative bias (RB), the relative root mean 
square error (RRMSEB), and the coverage rate (CVR). The 
RB and the RRMSE of a point estimator yy for Vy are 
defined respectively as, 


E,E,(¥u) are 
Yu 


RB(¥) = 100 x 


(Ep Eg(Vu — Jy)? 
Ju 


RRMSE(¥) = 100 x 


The expectations E,E,(¥y) and E,E,(¥y — Yu)? were 
estimated by Monte Carlo simulation using the 50,000 
realized response sets for each of 240 combinations. With 
this number of replicates, the Monte-Carlo error was less 
than 0.1%, assuming that the distribution of the ,y’s is 
approximately normal. We will use the abbreviation ARB 
to denote the absolute relative bias, | RB(¥) |. 

We will also discuss the coverage rate (CVR) of the 95% 
confidence interval constructed as 


y + 1.96/V (iu), (5.1) 


<p 


where jy is one of the 10 estimators and V(jy) the corre- 
sponding variance estimator. For Y,aimp and the 8 corrected 
estimators, we used the variance estimators described in 
Section 3. For y,, we used the variance estimator 


2 ae ; 
VACWE = -—- _— p.)~ — 1). 
(,) ts 7 L (Y_ — ¥,)°/(m ) 


The CVR is calculated as 100 times the proportion of the 
50,000 response sets such that the interval computed in the 
manner of (5.1) includes the true mean yy. 

For the following discussion, we group the corrected 
estimators into two groups: s-corrected estimators, which 
are based on correction factors involving x, or W,, that is, 
Cy, C4, k, and k4 and r-corrected estimators, which are 
based on correction factors involving X, or W,, that is, c,, 
c3, k, and k3. 

The nonresponse mechanism is the key to the perfor- 
mance of the various estimators. Therefore, Tables 2 and 3 
show the behavior of the estimators separately for each 
of the five mechanisms. We noted that the correlation level 
and the nonresponse rate do not have a very pronounced 
effect on the ranking of the estimators. Thus the perfor- 
mance measures ARB, RRMSE and CVR were averaged 
over 12 cases (three correlation levels x four nonresponse 
rates). These averages are shown in Table 2 for the RATIO 
type regression and in Table 3 for the CONCAVE, 
CONVEX and NONRATIO regression types. 
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Table 2 
Average ARB, RRMSE (RM) and CVR of Ten Different Estimators for the RATIO Type Populations 


For each mechanism, 12 cases were averaged (four nonresponse rates x three correlation levels) 


Ml M2 M3 M4 M5 
(uniform) (decreasing-x) (increasing-x) (decreasing-y) (increasing-y) 
Av. AV. Avy. Avy. AV. Av. Av. Av. Av. Av. AV. Av. Av. Av. Av. 


ARB RM _ CVR ARB RM _ CVR ARB RM CVR ARB RM CVR ARB RM CVR 


Ve OL? 13:95 9235 12:9 AG =86:0 oS 16s Sie IO PEK 7743 AEO SOO GS: 
Yraimp 0.2 1D Soe OG ey SBI) OFF OS eo Es Sic AO, SS) OW) ABS S545) 
Jers 1.0 133) Spal bl Gy SSS) 8.9) 1823) 7 9320 ges TN Sp. zal BAG) eB) CD? 
Baa 0.9 3 ee O23 AH NK SKE eee NG BBO) 1 PALE a 9288 SEA AO Beno DeD 
Irs 1.1 32D 9288 DEA IOOR 9059 8.0) Sse e93t5 Le ML a 2 9883 Deh Mlsyesh OPO) 
Vicde 1.0 Sisk Ove 7 DS MPO SIONS) ks G6 SBS eG, Aa Be itech 4g Oe) 
Vales ing WALT Dilevl 529 loa oO (Ie AG BiHHS eo) «ale = RIDES) eo) PALS) DG 
Vo3-5 1.6 14.4 91.4 GD NBS) me kOe I WAS SSI SS DMN WORD” “SORT 83) 20r4 9020 
Writs 2.0 WA SPB Sill OX. SOG: IS) PASS RSIS Hol WP Gp eses seat Bas SD 
Wide 7, 14737923 Sl DR4 SOLS WAN ye BIG) se} Om ae OD, Wat” PAR ORG 
Table 3 


Average ARB, RRMSE (RM) and CVR of Six Different Estimators for CONCAVE, CONVEX, 
and NONRATIO Populations 


(For each mechanism, 12 cases are averaged as in Table 2) 


Mi M2 M3 M4 MS 

Ay. Av. Av. Av. Av. Av. Av. Av. Av. Av. Av. Av. Av. Av. Av. 

ARB RM _ CVR ARB RM CVR ARB RM CVR ARB RM CVR ARB RM CVR 
CONCAVE 

J, 0.2 10.4 92.9 OES Pela Ome Oo. Thee ADE 9 3) 12 Sell. Omee Ses Silly ERS 4 aeOES 

Draimp 0.2 9.4 94.5 1.4 Sole 9354 220) SO 9429 1.9 OD 94:9 21 Og SPS) 

roe itil 11.4 92.4 (yey Te RY Wilates eee 3A NOW LO SRS 0 ARDS 

Wehlos 1.0 [i Slee 2e8 (Gy I Sy 3} FAT SIO S828 3.6" 103" "8958 Seo) eS) eo, 

Drs 1.0 LO fenee9 5 )0 ie = IKON eat Oey aks SOULS 17 OS 93n0) 3a “2S 93a 

Dyedrors 0.9 10.5 93.8 ANG NOR eH) DO GOR OES 1.8 939258 Sym ass Be) 
CONVEX 


ON) P37 — SO: OY SK 23 ISAO FOES Sail 33.20 4 ley) ee 6.4 37.1 41.4 37.5 


< 


Fade oe. 6 eG 2d 9036 ce TE OP 7.0.20 856 14.0 25.0 90.0 21.6) 335 500 
yo 12" “ae Rotts 0.4 19.8 91.8 2.0 222) 292% 7.3 20.8 93.4 17 Sees ari 
rere 2 Dien Os 0.3 19.9 91.5 ts 22354 6.7 20.6 93.4 18:5 =28/5.. 70:5 
Ten 1.6 31.9 91-9 FO M2020 8.0) 2226 9. Sau22, 78 917 16:2: 2F.6. Wid 
Tee aE Dia O1 6 DOF MTN 201 38 D6 422.09 49233 9.5. $20 7Em oie 17 6nne aioe 
NON-RATIO 
5, Ox e S107 102.9 9.7 14.6 86.5 jit ON She 11,9: S16. 0 8029 828) i Sosa 
SAIN LP Ro OLE 2.1" 99.54 04 2.6 SOSe 9538 Di apa S65 O44 Wes Sere 
he ives hoe $0205 7a OM a8 515 11:9" “1888.2 2.6 10.0 90.9 503+ 42 Ste 0s 
Dak Loar o24 (pe OAL ek i 5 eal 8o4 25) (lO 1 90. 6 ADs) ISLS WAIN 
Vens§ fav T8934 5:0) 10.0 eerenr9 11.3. 19.0 90.7 13 9:6. 192°8 49 TAS 935 


Sy Piles ital 10.9 93.4 Ota le OoOrS MOR Aa itl ites On IP SPAS 41 13.4\, 93.8 
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We now comment on the tables. A conclusion of general 
character is that the respondent mean J, has, as expected, 
a large bias and a very poor CVR for all of the nonuniform 
mechanisms. Its performance is satisfactory only for the 
uniform mechanism (M1). Thus we can focus on the 
comparisons between the uncorrected J,aimp On the one 
hand and the eight corrected estimators on the other. For 
both of the criteria ARB and RRMSE, we noted that the 
s-corrected estimators generally gave better results than 
the r-corrected ones. This is clearly seen in Table 2, where 
s-corrected and r-corrected estimators are displayed in 
two separate groups. Given this better behavior of the 
s-corrected group, we deleted the r-corrected group in 
Table 3. 


5.1 RATIO Type Regression 
From Table 2, we draw the following conclusions. 


(i) The mechanism (M1) (uniform nonresponse). 


When the mechanism (M1) holds, the uncorrected 
estimator Y,aimp 18 essentially bias free, and there is no 
need to correct. However, if the analyst, suspecting a 
confounded mechanism, has nevertheless chosen one of 
the corrected estimators, the penalty is not severe. The 
eight corrected estimators show only a small increase in 
ARB and in RRMSE compared to J,aimp- 


(ii) The mechanisms (M2) and (M3) (unconfounded, 
nonuniform and x-value dependent). 


For these mechanisms, the ARB is seen to be very small 
for the uncorrected estimator J;aimp, aS theory would lead 
us to expect. Our interest is instead focused on the 
behavior of the eight corrected estimators, since it is 
important to know if a penalty is associated with an incor- 
rect decision to use one of these estimators. Such a decision 
would be brought about by an incorrect assumption that 
the response mechanism is confounded (when in fact it is 
unconfounded but nonuniform). Table 2 shows that there 
is indeed some penalty in the form of both increased ARB 
and increased RRMSE. The penalty is less severe for the 
s-corrected group. For both groups, the penalty is less 
severe for the mechanism (M2) than for the mechanism 
(M3). 


(iii) The mechanism (M4) (confounded and y-value 
dependent). 


For this mechanism, a striking feature of Table 2 is that 
all eight corrected estimators give a substantial bias reduc- 
tion compared to the uncorrected estimator imp (and a 
very large reduction relative to the naive estimator j,). 
The corrected estimators also show some improvement in 
RRMSE compared to J;aimp. The s-corrected estimators 
perform better than the r-corrected ones. Within the 
s-corrected group of estimators, the differences are minor, 
as is the case within the r-corrected group. 
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(iv) The mechanism (M5) (confounded and y-value 
dependent). 


Table 2 shows that the s-corrected estimators have a 
smaller ARB than the uncorrected ¥,aimp; their RRMSE is 
slightly higher. By contrast, the r-corrected estimators 
““overcorrect’’ so that both the ARB and the RRMSE 
exceed the levels observed for Y,aimp. The r-corrected 
group does not perform well for this mechanism. 

In summary, Table 2 shows that if the ratio model (2.2) 
holds and the assumption of a confounded mechanism is 
correctly made, the decision to use one of the corrected 
estimators may lead to a reduced bias. The main difficulty 
facing the analyst is to accurately predict the nature of the 
response mechanism causing nonresponse. In particular, 
it may be difficult for the analyst to separate a confounded 
mechanism (e.g., one with 8, = e  *) from a similar 
nonuniform unconfounded mechanism (e.g., one with 
0, = e *k). Yet this subtle difference has a marked 
effect on the bias of ¥,aimp) and on the decision whether or 
not to use a corrected estimator. When the nonuniform 
unconfounded type applies, we have seen that there is 
a penalty associated with the corrected estimators, in 
particular with the r-corrected group. 


5.2 Other Regression Types 


Table 3 shows the performance of six estimators 
(the two uncorrected and the four s-corrected) for the 
CONCAVE, CONVEX, and NONRATIO regression 
types. As in Table 2, there is little to choose between the 
estimators when the uniform mechanism (M1) holds. For 
the two confounded mechanisms, the results in Table 3 do 
not send a clear message that s-corrected estimation should 
be attempted even if the assumption of a confounded 
mechanism is correctly made. Compared to the uncorrected 
Yraimp» the s-corrected estimators show a clearly improved 
performance (in terms of smaller ARB and smaller 
RRMSE) only for the CONVEX population type. Even 
in this case, a substantial bias remains after the attempt 
at correction. For the two unconfounded nonuniform 
mechanisms (M2) and (M3), it is a priori clear that one 
would not expect improved performance on the part of the 
s-corrected estimators when compared tO J,aimp. Oddly 
enough however, we find that the s-corrected estimators 
work very well for the CONVEX population. These 
conclusions leave the analyst with a difficult choice if a 
RATIO type population cannot be assumed. Then it is 
difficult on the basis of our findings to recommend the use 
of one of the corrected estimators. 


5.3 Coverage Rates 


Tables 2 and 3 also show that the variance estimation 
procedure suggested in Section 3 generally works well. 
Indeed the coverage rates for the corrected estimators are 
uniformly good whenever the ARB is small. In particular, 
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Table 4 
Average ARB, RRMSE (RM) and CVR of the Two Uncorrected Estimators and the 
c4 - and kg - Corrected Estimators 
(Averaged Over All Population Types) 
M1 M2 M3 M4 M5 Overall 
Av. Av. Av. Av. Av. Avy. Av. Av. Av. Av. Av. Av. Av. Av. Av. Av. Av. Avy. 
ARB RM CVR ARB RM CVR ARB RM CVR ARB RM CVR- ARB RM CVR- ARB RM CVR 
Jy, O45 13.0 20.0 86.8 Oe diielin Sy ? IO BW Gee. GE BPR CDE 11.9 19.6 80.4 
Vrain OCS" 32 693-1 DESOANSONE9229 Srl 409220 5.8 14.2" '93°0 Ose 16.7, 2 8.0 4.2 14.2 90.4 
Weloe OP AD me ODS 4.7 14.0 86.8 8.3 19.0 90.8 Sl IB Pe Dies all WEI EO SPIES © SRT 
Dr cc. ee Welle TAO) SZ) B ASML SkGaus9tS Ted S742 392.2, 346) 11626). SPE Geil ol ADs ASA. 2 an lS ee Sy) 


for the unconfounded mechanisms (M2) and (M3), the 
coverage rates for the corrected estimators are about equal 
to or better than those for the uncorrected estimators. 


5.4 Overall Comments 


From the summary Table 4, we note that, as expected, 
JY, and Yraimp Show the best performance for the uniform 
response mechanism (M1). The uncorrected estimator 
Yraimp 18 the best one for the unconfounded mechanisms 
(M2) and (M3), while the corrected estimators are the best 
ones for the confounded mechanism (M4) and (M5). 

Finally, on the average over all 240 cases included in 
our study, we note from the overall column of Table 4 that 
Vraimp aNd Yx4.; perform similarly with the former having 
a slightly smaller bias and the latter having slightly better 
coverage rate. 


6. CONCLUSIONS 


It has long been recognized that nonresponse causes 
bias in survey estimates, except in rare cases. Imputation 
is a widely used practice to handle nonresponse, because 
it is convenient to work with a complete data set. There 
are many imputation rules as well as some softwares that 
can be used in large scale surveys. Imputation is sometimes 
applied without critical questioning, and, although widely 
used, imputation does not solve the critical problem of bias 
caused by nonresponse. 

In this paper, we have examined ratio imputation. The 
ordinary ratio imputation B,x;, is justified (that is, it 
produces no bias) if two conditions hold: (a) the regression 
model behind the ratio imputation rule holds (that is, a 
linear regression through the origin); (b) the response 
mechanism is unconfounded. 

The results of our simulation give some idea of the 
magnitude of the bias of the usual ratio imputation esti- 
mator Vraimp When one or both of the two conditions 
break down. We considered several nonuniform response 
mechanisms, confounded as well as unconfounded mech- 
anisms. We also considered breakdown of the regression 
model behind ratio imputation. 


We argued that a confounded mechanism can sometimes 
be realistically assumed in a survey. We showed that if an 
assumption of confounded response mechanism is correctly 
made, and if the model behind the ratio imputation is 
valid, one can make some progress toward bias reduction 
using the s-corrected estimators in this paper. They have 
substantially less bias than the uncorrected estimator 
Vraimp: Lhe s-corrected estimators are generally more effec- 
tive than the r-corrected estimators for reducing the bias. 

Suppose the analyst is working under the assumption 
that the ratio model (2.2) holds. Our simulation study then 
leads to suggested estimators according to the following 
Table 5, depending on the assumed nature of the response 
mechanism and on the nonresponse rate. The entry ‘‘any”’ 
means any of the 10 estimators in Table 2. 


Table 5 
Suggested Estimators for Each Nonresponse Mechanism 


Suggested Estimator 


DURAGNS Response Mechanism 
Rate 
Uniform Unconfounded Confounded 
(< 10%) any any but J, any but j, 
(> 10%) any! Yraimp s-corrected 


Note 1: ¥;-aimp as a slight advantage over the others. 


If the regression model behind ratio imputation fails, 
the situation is less clear. Unless the naive assumption of 
a uniform response mechanism holds (which is unlikely), 
the uncorrected ratio imputation estimator Y-aimp Can 
have considerable bias. We found that Y,aimp is partic- 
ularly prone to bias for the CONVEX type population 
where the s-corrected group of estimators usually have 
smaller bias than J,aimp. On the other hand, for the 
CONCAVE and the NONRATIO type populations, ¥,-aimp 
is generally more resistant to bias than the s-corrected 
estimators. 
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Dual System Estimation of Census Undercount in the Presence 
of Matching Error 


YE DING and STEPHEN E. FIENBERG! 


ABSTRACT 


Dual system estimation (DSE) has been used since 1950 by the U.S. Bureau of Census for coverage evaluation of 
the decennial census. In the DSE approach, data from a sample is combined with data from the census to estimate 
census undercount and overcount. DSE relies upon the assumption that individuals in both the census and the sample 
can be matched perfectly. The unavoidable mismatches and erroneous nonmatches reduce the accuracy of the DSE. 
This paper reconsiders the DSE approach by relaxing the perfect matching assumption and proposes models to 
describe two types of matching errors, false matches of nonmatching cases and false nonmatches of matching cases. 
Methods for estimating population total and census undercount are presented and illustrated using data from 1986 


Los Angeles test census and 1990 Decennial Census. 


KEY WORDS: Capture-recapture; Matching bias; Modelling matching error; Multinomial likelihood. 


1. INTRODUCTION 


The problem of undercount in the U.S. census has been 
of special concern since the first census of 1790 (Jefferson 
1986). The DSE (or capture-recapture) approach has been 
used in conjunction with the census to evaluate population 
coverage as part of what is called the post-enumeration 
survey (PES) program. Ericksen and Kadane (1985) and 
Wolter (1986) describe the use of the DSE approach in the 
context of the 1980 decennial census. A new design for the 
PES was planned for the 1990 decennial census and 
refinements in methodology were examined in connection 
with a 1986 test census in central Los Angeles County, 
referred to as the Test of Adjustment Related Operations 
(TARO). Diffendal (1988) discusses methodology, opera- 
tions, and the results of TARO, and Hogan and Wolter 
(1988) and Schenker (1988) provide evaluation of the oper- 
ations and assumptions underlying the DSE approach. 

The PES approach to dual-system estimation uses two 
samples, called the P-sample and the E-sample. The P-sample 
which is drawn separately from the census, helps to measure 
census omissions; the E-sample drawn from the census 
enumerations, helps to measure census erroneous enumer- 
ations. For the 1986 TARO, the dual-system estimator for 
the population size, N, which combines the information 
from the P-sample and the E-sample takes the form: 


N = (CEN — EE — SUB) - N,/M, 


where CEN is the unadjusted census count; EE is the esti- 
mated number of erroneous enumerations and unmatchable 


persons included in the census; SUB is the number of 
whole-person substitutions in the census; N, is the number 
of people in the P-sample; M is the estimate of the number 
of people in both census and the P-sample. For details see 
Diffendal (1988) or Wolter (1986). For the variation on 
this formula as used in conjunction with the 1990 census, 
see Hogan (1992, 1993). 

DSE and the matching problem gained considerable 
attention in the 1970’s due to its use in estimating births 
and deaths in developing countries, and it is thought by 
some that perhaps the greatest problem with the dual- 
system estimation approach used in 1980 census was the 
rate of matching error (Fienberg 1989). Jaro (1989) 
describes the technological innovations for matching 
introduced by the Bureau of the Census for 1990 and the 
test of the related matching methodology in a 1985 pre- 
test. Biemer (1988) considers models for evaluating the 
impact of matching error on estimates of census coverage 
error without attempting to correct for the matching bias 
in the usual dual-system estimate. The actual procedure 
used in the 1990 census included not only a computer mat- 
ching algorithm and various clerical follow-ups but also 
logistic regression models for unresolved cases in both the 
P-sample and E-sample (see Belin ef a/. 1993). 

Matching is used to determine the census enumeration 
status of the people enumerated in the P-sample. Specifi- 
cally, those people in the P-sample who are matched to the 
census are considered to have been enumerated. People 
in the P-sample who do not match are, for the most part, 
considered to have been missed by the census. Matching 
errors can occur for two general reasons: 


' Ye Ding is Research Scientist, Bureau of Biometrics, New York State Health Department, Concourse, Room C-144, Empire State Plaza, Albany, 
New York 12237, U.S.A.; Stephen E. Fienberg is Maurice Falk Professor of Statistics and Social Science, Department of Statistics, Carnegie Mellon 


University, Pittsburgh, Pennsylvania 15213, U.S.A. 
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1. The information reported by the respondents/inter- 
viewers was incorrect. 

2. Correct information was reported, but it was not cor- 
rectly used. 


Moreover, two types of errors can occur: false matches of 
nonmatching cases and false nonmatches of matching cases. 
False matches of nonmatching cases may be divided into 


(a) instances in which a P-sample case was erroneously 
matched to the enumeration of another person, but a 
match to that actual E-sample case should have been 
made, and 

(b) instances in which no match should have been made. 


The former case is not ‘“‘serious’’ for the purposes of 
estimating N, since such false matches would have been, 
in fact, correctly classified as a match to the census. In the 
second case, however, the number of nonmatches becomes 
understated. False nonmatches to the census, on the other 
hand, have the effect of overestimating the nonmatch rate. 
Fay, Passel, Robinson and Cowan (1988) note that false 
nonmatches probably represent a greater concern than 
false matches. False matches are less common than false 
nonmatches because matches can be reviewed easily. 

In Section 2, we propose models for matching errors 
and then, in Section 3 and 4, we present a systematic 
procedure for the estimation of the population total and 
thus the census undercount. In Section 5, we analyze the 
data from 1986 Los Angeles test census and 1990 Decen- 
nial Census to show how our method accounts for mat- 
ching errors in the undercount estimates. 


2. MODELING MATCHING ERRORS 


For simplicity, we assume that the matching mechanism 
is constrained, in the sense that no individual in one sample 
can be matched with more than one individual in another 
sample. Moreover, we implicitly assume a version of 
simple random sampling, within strata, and this yields a 
standard multinomial sampling model for dual system 
estimation. This simplification allows us to focus on the 
impact of matching and its mechanisms. In what follows, 
we provide a way to view the recapture data, for the 
purpose of setting up models for matching. 

Let Zj,.; be the characteristic vector for the whole 
population, such that the /-th component of Z,,,.,; contains 
the characteristics for the /-th individual, wherel < 7 < N. 
Not all the components in Z,,,.,; can be observed in any 
one sample. The object is to estimate N, the size of the 
population, from information from two samples. One 
could view drawing a sample from the population as 
drawing some components in Z,,,; at random to form a 
new vector Y. Then, missing or misreporting of certain 
characteristics in those components drawn may cause 
matching errors. Henceforth we will refer to the first 


sample as Y, and the second sample as Y , and in the 
following discussion they will be the two capture-recapture 
samples for dual system estimation. 

Two types of matching errors can occur: false non- 
matches of matching cases, and false matches of non- 
matching cases. We will refer to the former as a type | error 
and the latter as a type 2 error. We can focus on modeling 
one or both types of error. Under perfect matching, each 
component in Y, or Y, contains the same information as 
in Zjy.1, and the number of matches will be the number 
of elements common to Y,; and Y. When faced with uncer- 
tain matching, we consider the following simple model: 


Model (A): 

(i) Assume that those matched pairs of components under 
perfect matching will still be matched, each with 
common probability a,0 < a < l. 

(ii) All those unmatched will remain unmatched, /.e., 
no false matches. 


Model (A) characterizes a mechanism for type 1 matching 
error with error probability 1 — a, assuming that type 2 
matching error is negligible. 

To develop a model for both types of matching error, 
we need to consider carefully all the possibilities that lead 
to false matches. When there is no matching error, one can 
write Y,; = (M,, N,) and Y, = (M3, N>), so that sets 
M, and M;j have the same size and every individual in M, 
is correctly matched with one individual in M, and vice 
versa, N, is the set of those in sample Y, who are not 
matched with any one in sample Y>, and N,j is the set of 
those in sample Y, who are not matched with any one 
in sample Y,. When matching errors are present, false 
matches can occur in the following ways: 


(a) A person in M, is matched incorrectly with a person 
in M). 

(b) A false match occurs between M, and Nj. 

(c) A false match occurs between M, and N,. 

(d) A false match occurs between N, and N3. 


We note that each of (a), (b), (c) happens only when at 
least 2 errors are made, that is, the correct match is not 
made and an incorrect match is made. Since such errors 
occur with small probability, we assume for simplicity that 
cases (a), (b), (c) have negligible probability of occurrence 
in the next model. 


Model (B): 

(i) Assume, as in model (A), that matching pairs between 
M, and M, will still be matched, but with probability 
on) <= @ = il. 

(ii) Assume that false matches of types (a), (b), (c) are 
negligible. 

(iii) Assume that each person in N, will be matched 


with someone in N, with a common probability 6, 
(0) SS yey ei, 
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Even though, in theory, both a and 6 can vary from Oto 1, 
in the census context we expect thata ~ 1, and = 0. 

We can also consider instances in which the matching 
error probabilities and capture probabilities potentially 
vary over identifiable population subgroups. In other 
words, the population can be divided into strata, by demo- 
graphic (e.g., age, race, sex) and geographic variables, 
within which the matching error probabilities and capture 
probabilities could be assumed to be more homogeneous 
than in the whole population. Suppose the whole popula- 
tion consists of / strata. Let iy be the characteristic 
vector for the population of the /-th stratum with unknown 
size N;, and let Y;;, Y;2 be two samples taken from the i-th 
sttatum which are used to get an estimate N;. Then we 
can form an estimate of the overall population size by 
setting N = y{_, N;. Wecan refine models (A) and (B) 
as follows: 


Model (A’): 

Assume model (A) holds within each stratum, and let 
a; be the probability of a match for matching components 
in Stratum 7; 0%" aj/= Tp ihsti sf. 


Model (B’): 
Assume model (B) holds within each stratum, and let 


the two probability parameters for i-th stratum be a, 6;, 
ss ts YE 


For 1990 PES, the P-sample matching was conducted 
using the sample blocks plus a ring of surrounding blocks 
(Hogan 1993). Geocoding errors may lead to false matches 
across geographically defined post-strata, and false matches 
are possible for demographically defined post-strata. 
Models (B’) implicitly assumes that there are no false 
matches across post-strata. Further, all of the models 
represent a simplification of the underlying sample design 
of the PES. 


3. ESTIMATE THE POPULATION TOTAL 


In this section, we consider estimation of the population 
total under the various matching models, (A), (A’), (B), 
and (B’), assuming the validity of usual assumptions of 
independence of the two samples and homogeneous prob- 
abilities of inclusion in the samples. For models involving 
heterogeneous catchability and/or dependence, see the 
three-sample approach in Darroch ef al. (1993) and the 
approach in Alho et al. (1993). 

Let N be the number of individuals in the population 
under consideration, x; , the number of individuals in Y;, 
x, , the number of individuals in Y5, and x,, the number 
of individuals in both samples. The number of individuals 
observed in Y> but not Y; is x.) = x,; — X,;, and the 
number observed in Y,; but not Y> is Xx}. = xX;4 — X)1. 


151 


One can arrange the capture-recapture data ina 2 x 2 
contingency table with one missing cell: 


Sample Y, 
present absent 
present X11 X12 
Sample Y; , 
where we use symbol ‘‘ —”’ to indicate the missing cell, and 


standard notation for marginal totals: x;, = X,,; + X}, 

X41 = Xj; + X;. There is a corresponding 2 x 2 table of 

probabilities, p;, = Pr [any individual falls into (i,/) cell], 
Sample Y> 


absent 


present 


present 
Sample Y; 
absent 


with the usual linear constraint 


2 


‘2 Pore 


i=l j=l 


Let n be the number of observed different individuals 
in the two samples, i.e., nm = X;; + X}2 + x2. If we 
assume that the samples are randomly selected with homo- 
geneous selection probabilities, then the numbers of indi- 
viduals in the four cells have a multinomial distribution 


(X11,X12,%21,N — n) ~ Mult(N, py, Pi2, P21, P22)- 


We use the conditional likelihood approach developed 
by Sanathanan (1972). For fixed n, (x1, X12, X2,) has a 
multinomial distribution with likelihood function 

ni Put Pi" pat' 
Ne Xa! (yy + Pix + Pa)” 
(1) 


Ly (Pius P12, P21) = 


Then 7 is viewed as being binomially distributed with 
sample size N and probability p;; + pj. + Pr, and the 
corresponding likelihood is 


N! 
Li(N) = (Pit = “0 ae 250 


ni(N — n) 
[1 — (pn + Pin + Pn)". 


In the conditional approach we derive maximum likelihood 
estimates for the cell probabilities based on the likelihood 
(1), then find the value of N which maximizes (2), given 
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the values of the cell probabilities. Sanathanan (1972) has 
shown that under suitable regularity conditions both 
conditional and unconditional likelihood estimates of 
N are consistent and have the same asymptotic multi- 
variate normal distribution. The conditional approach 
is particularly suitable for a large sample problem like 
ours. 

Under the equal catchability assumption, we let p, be 
the probability that any individual in the population is 
included in Y,, and similarly we let p, be the probability 
of inclusion in Y. The probabilities p, and p, are usually 
referred to as capture probabilities and they do not depend 
on how the matching mechanism operates. Then the prob- 
ability that an individual is in both samples is p,p2, and 
the probability of being in set N, is p,(1 — p>). Since 
model (A) is a special case of model (B) with 6 = 0, 
we focus on formulating the problem under model (B). 
To do this, we first need to work out the parametric 
specification of the cell probabilities. An individual will 
fallinto the (1,1) cellinthe2 x 2 table only in two cases, 
i.e., the individual is actually in both samples and a match 
is made, or, using the notation in the last section, an indi- 
vidual who is actually in N, is incorrectly matched with 
some one in N. Here the matching direction from N;, to 
Nj is implicitly assumed in (iii) of model (B). The probabil- 
ity that the former case occurs is wp), and the probability 
that the latter case occurs is Bp,;(1 — p>). Furthermore, 
the two cases are mutually exclusive. Thus, we have 
Pu = PP, + Bp, (1 — po), and, ppp = Py — Pu = 
Die @pip2 Rpt Po)s Da Pash eS las 
ap,;P> — Bp,(1 — pr). Rao (1957) studied regularity 
conditions under which there exist unique maximum 
likelihood estimates of parameters in a multinomial distri- 
bution. His conditions are satisfied by the parameterization 
of {pj} here. 

Fora = 1,6 = O, this setup reduces to the usual two 
sample problem and there exist well known solutions in 
closed form for resulting likelihood equations for the 
conditional likelihood (1) (cf. Bishop et al. 1975, chap. 6, 
p. 232), leading to the usual dual-system estimator, 
Nose = X14X41/X1;. Otherwise, the maximum likelihood 
estimates cannot be written in closed form. Once we have 
Pp, and p>, however, the conditional maximum likelihood 
estimates for p,; and p>, the conditional maximum like- 
lihood estimate for N can be written as 


n 


py + Bo — (a — 8) bbs — BA, 


Nt (3) 


(cf. Chapman 1951). Under model (A’) or (B’), for the 
i-th stratum, one can use the estimates of the parameters 
computed under model (A) or (B) for the data of that 
stratum, and then sum over strata for an estimate of the 
population total. 


4. ESTIMATE MATCHING ERROR RATES BY 
REMATCH STUDY DATA 


In what follows, we give estimates of the matching error 
rate parameters a and @ using the data from the Matching 
Error Study (rematch study), one of the operations 
conducted by the Census Bureau in the 1986 Los Angeles 
test census to evaluate the PES. Briefly, the rematch 
typically operates for a sample of cases, using more exten- 
sive procedures, highly qualified personnel and reinterviews 
to obtain estimates of the bias associated with the previous 
matching process. For further details, see Childers, 
Diffendal, Hogan and Mulry (1989). In their discussion 
of the Matching Error Study in Los Angeles TARO, 
Hogan and Wolter (1988) state that ‘‘The rematch was 
done independently of the original match, and the discrep- 
ancies between the match and the rematch results are 
adjudicated. Because of this intensive approach to the 
rematch, we believe the rematch results represent true 
match status, while differences between the match and 
rematch results represent the bias in the original match 
results.”’ 

The data collected in a rematch study can be displayed 
as in the following table 


Rematch Study Data 


Rematch 
Classification 
Not 
Matched Matched 
Creu Matelicg aa trey oii 
Classification Not 
Matched 21 ¥22 


To estimate a and 6, we assume that in the original 
matching process, errors are made according to model (B) 
and that errors in the rematch process can be disregarded, 
i.e., the rematch is assumed to be perfect. It then follows 
that y;; + >; is the true number of matches, and thus is 
fixed, while y,, is a random variable having a binomial 
distribution, /.e., yj); ~ @(y1, + Y21,@). Thus the max- 
imum likelihood estimate of a is &@ = y1;/(¥11 + Yn); 
and the maximum likelihood estimate of the false nonmatch 
rateyisY = 1 — & = yo)/(¥, + y21). By the same argu- 
ment, Vjo ~ B(V;2 + Yn, 8), and the maximum likelihood 
estimate of the false match rate is 8B = yio/(in + Yn). 

We can use the estimates of the matching error rates 
derived here to analyze the data from the rematch study 
from the Los Angeles test census. Very often, in addition 
to estimating the size of a population, it is of interest to 
estimate the size of a subpopulation such as black, white, 
or a subpopulation at a certain geographical location. In 
such case, it is more appropriate to allow for heterogeneity 
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of matching error rates across various population strata 
by using estimates of matching error rates for each stratum 
of interest. Such estimates can be obtained by conducting 
a rematch study within each stratum and then using the 
derived estimates. Data for applying model (B’) are 
available from 1990 Census and are analyzed here. 


5. APPLICATIONS 


5.1 Application of One Stratum Model to 1986 TARO 


Hogan and Wolter (1988) present the rematch data 
from the 1986 Los Angeles TARO. The rematch results 
for the P-sample are given in Table | in the form of a cross- 
tabulation of match statuses as assigned from the original 
TARO match and the rematch. Table 2 presents the two 
way table of data for the 1986 TARO, with no post- 
stratification. The estimate of the number missed by both 
systems, 5,870 is approximately the same order of magni- 
tude as census substitutions 5,259 and erroneous enumera- 
tions 6,426 (Hogan and Wolter 1988). Rematch results for 
the E-sample are presented in Table 3. Let CP, EP be the 
total correct enumeration and erroneous enumeration by 
production classification, and let CR, ER be the total 
correct enumeration and erroneous enumeration by 
rematch classification, then based on the data in Table 3, 
Hogan and Wolter (1988) conclude that the original rate 
of erroneous enumerations (EE), EP/(CP + EP) = 
325/(325 + 19,269) = .016 should be increased to about 
ER/(CR + ER) = 411/411 + 19,334) = .021. 


Table 1 


Results of 1986 Los Angeles Test Census Rematch Study: 
P-Sample. Source: Hogan and Wolter (1988) 


Rematch Classification 


Original 
Match 
eee Not Un- 
Classification Matched Rinne resolved Total 
Matched 16,623 18 55) 16,696 
Not matched 88 2,164 56 2,308 
Unresolved 17 0 132 149 
Total 16,728 2,182 243 19,153 
Table 2 


Data and Dual-System Estimate for 1986 Los Angeles Test 
Census. Source: Hogan and Wolter (1988) 


PES 
Counted Missed Total 
Counted 298,204 45,463 343,667 
Correct Census — fissed 38,503 5,870 44,373 
Enumerations* 
Total 336,707 51,333 388,040 


* Correct Enumerations = Total Census Enumerations — Substitutions — 
Erroneous Enumerations. 


153 


Table 3 


Results of 1986 Los Angeles Test Census Rematch Study: 
E-Sample. Source: Hogan and Wolter (1988) 


Rematch Classification 


Original 
Match Correct Erroneous 
Classification Enumer- Enumer- Unresolved Total 
ation ation 

Correct 

enumeration 19,153 28 88 19,269 
Erroneous 

enumeration 41 283 1 325 
Unresolved 140 100 228 463 
Total 19,334 411 312 20,057 


We now reanalyze the data in Table 2 using model (B), 
but ignoring the unresolved cases in Table 1 because their 
classification status are unavailable to us. From the data in 
Table 1 we estimate Y = 1 — & = 88/(16,623 + 88) = .53%, 
and B = 18/(18 + 2,164) = .82%. In Table 4, we present 
the estimates and associated standard deviations under 
model (B) and those from the traditional DSE. The standard 
deviations are computed using asymptotic normality, 
for details, see Ding (1990, 1993a, 1993b). The esti- 
mated undercount is then defined to be undercount = 
(N — CEN)/N x 100%, and CEN is the total census 
enumerations, i.e., CEN = Correct Census Enumeration + 
Substitutions + EE = 343,667 + 5,259 + 6,426 = 
355,352. The estimates on the last row of Table 4 indicates 
that the undercount estimate provided by the DSE should 
be reduced by 8.42% — 8.05% = .37%. We recall that 
Hogan and Wolter (1988) argue that the original rate of 
EE should be increased by 2.1% — 1.6% = 5% asa 
result of information in the rematch study. This then gives 
an additional adjustment to the estimated undercount of 
about .5%. Overall, we estimate that the undercount 
estimate was biased upward by about .9% (assuming the 
overlapping is negligible, even though two components are 
not strictly additive). 


Table 4 
Comparison of Estimates for 1986 Los Angeles Test Census 


MLE from 
Parameter DSE (SD) Model (B) (SD) 
Pi (8856(5.48ex 10) 8892.65.51. 10.u') 
Po {S677 (5578510 77 )h) (8712 (5/86 10") 
N 388,040 (87) 386,470 (79) 
Undercount (%) 8.42% 8.05% 
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Table 5 
13 Evaluation Post-strata (EPS) for 1990 PES 


Northeast, Central City, Minority 
Northeast, Central City, Nonminority 
U.S., Noncentral City, Minority 
Northeast, Noncentral City, Nonminority 
South, Central City, Minority 

South, Central City, Nonminority 

South, Noncentral City, Nonminority 
Midwest, Central City, Minority 
Midwest, Central City, Nonminority 

10 Midwest, Noncentral City, Nonminority 
11 West, Central City, Minority 

12 West, Central City, Nonminority 

13. West, Noncentral City, Nonminority + Indian 


© co —) OF MM & G bb —& 


Table 6 
Dual System Data for 13 EPS of 1990 PES 


BPS X,4 (Census) xX, (P-sample) X11 
1 5,966,529 4,656,305.09 4,284,132.78 
2 92395705 8,685,235.79 8,626,362.34 
a8 24,255,611 22,628,349.88 21,068,045.55 
4 31517385378 30,150,266.34 29,966, 142.62 
Sf 9,985,055 8,809,620.02 8,249,407.92 
139 Te 29 13,582,482.34 13,278 ,614.01 
7 47,548,548 44,059,397.93 42,987,517.59 
8* 4,060,286 3,714, 168.27 3,520,314.04 
9 11,826,352 10,058 ,288.52 9,854,052.95 
10 39,343,787 38,358,735.32 38,031,852.01 
Vie 7,283,885 5,743 ,998.39 5,365,961.67 
12 11,073,872 NOIR SS 959 10,222,147.69 
13 26,415,232 26,721,116.28 26,025,370.25 


* Corresponds to minority post-stratum. 


Table 7 
Results of Rematch Study for 13 EPS of 1990 PES: P-Sample 


EPS Yi 21 Dip) 22 
i 14,301 124 31 23 
2 15,051 36 16 1,136 
St 28,784 293 49 4,166 
4 BIS: 703 Di 2,058 
ay 28,674 189 18 3,738 
6 Dill Sa, 69 36 1,156 
Vf 48,061 47 20 3,278 
8* 14,800 58 DAN DROZ 
9 16,527 39 20 874 

10 43,721 120 107 1,664 
1 ES22, 133 11 2,097 
12 U5, 122 59 8 1,078 


13 43,356 22, 108 4,583 


Table 8 

Results of Rematch Study for 13 EPS of 1990 PES: E-Sample 

EPS CE EP CR ER 
il 17,027 1,415 17,106 1,645 
2 15,821 879 15,631 932 
37 32,420 2,430 Sh See! 2,446 
4 33,369 1,242 Sy pa) 1,665 
5% 32,412 1,880 33,030 2,044 
6 24,392 575) 24,336 1,284 
7 51,107 2,908 50,929 3,047 
8* 17,174 1,518 17,133 1,526 
9 18,279 648 18,228 656 
10 44,450 1,604 44,584 1,631 
te 13,644 985 13,693 909 
12 15,647 p22 15,590 583 
13 49,647 2,062 49,545 2,334 


5.2 Application of Multiple Strata Model to 1990 Census 


We now analyze stratified data from the evaluation of 
the PES carried out as part of 1990 decennial census. 
Hogan (1993) describes operations and results for the 1990 
PES, Mulry and Spencer (1991, 1993) present total error 
analysis, and Davis ef al. (1991) report on the PES 
Matching Error Study (MES). The MES was conducted 
for each of 13 Evaluation Post-strata (EPS) by geographic 
region and ethnic group. Of the 13 EPS listed in Table 5, 
five correspond to substantial minority populations 
(Blacks and Hispanics), 7.e., EPS 1, 3, 5, 8 and 11. In 
Table 6, we present the dual system data for each of the 
13 EPS, and we give, in Table 7 and Table 8, relevant 
rematch data for the P-sample and E-sample. These data 
are drawn from the final reports on PES evaluation 
projects P7 and P10 by the Census Bureau (Davis and 
Biemer 1991la, 1991b). The P-sample for the 1990 PES 
consisted of about 172,000 housing units (Hogan 1992). 
The P-sample data are weighted to get estimates of x, 
(P-sample total) and x,;; (total matches) in the usual 
analysis of the dual system data and the analysis presented 
here. Nevertheless, the actual unweighted P-sample data 
can be used to make inference, see Appendix for compar- 
ison between estimates from actual P-sample data and 
estimates from weighted P-sample data. 

In Table 9, we give the usual dual system estimates and 
standard deviations of the capture probabilities (/.e., cov- 
erage rate by Census or P-sample) for each of the 13 EPS. 
Estimates in Table 10 indicate that there is significant 
variation in matching error rates across the EPS. Among 
three EPS with ¥ larger than .01%, EPS 3 and EPS 11 are 
minority post-strata. This suggests that the nonmatch rate 
may be higher for minority post-strata than for the 
remainder. On the other hand, there is no clear evidence 
from the estimates of @ that the false match rate is higher 
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Table 9 


Usual Dual System Estimates and Standard Deviations 
for 13 EPS of 1990 PES 


pz (SD) 


EPS 


Weg 


py (SD) 


0.92007 (12.57 
0.99322 (2.78 
0.93105 (5.33 
0.99389 (1.42 
0.93641 (8.22 
0.97763 (4.01 
0.97567 (2.32 
0.94781 (11.54 
0.97969 (4.45 
0.99148 (1.48 
0.93419 (10.35 
0.97240 (5.05 
0.97396 (3.08 


DP DS XG KEK eS OK EK OE ae a OS 


10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 


0.71803 (18.42 
0.93402 (8.17 
0.86858 (6.86 
0.96127 (3.46 
0.82618 (11.99 
0.95000 (5.83 
0.90408 (4.27 
0.86701 (16.85 
0.83322 (10.84 
0.96665 (2.86 
0.73669 (16.32 
0.92309 (8.01 
0.98524 (2.35 


Table 10 


x 


Xe SKF SDK OK EX OOS RK ES SS Oe OS 


10-5) 
10-3) 
10-5) 
10~5) 
10~5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 


N (SD) 


6,484,855 (470) 
9,298,737 (67) 
26,051,987 (540) 
31,364,919 (88) 
10,663,134 (390) 
14,297,391 (131) 
48,734,156 (359) 
4,283,875 (190) 
12,071,466 (224) 
39,681,946 (108) 
7,797,041 (443) 
11,388,243 (164) 
27,121,400 (104) 


Estimates of Matching Error Rates 
for 13 EPS of 1990 PES 


EPS 


¥(%) 


0.009 
0.002 
0.010 
0.021 
0.007 
0.003 
0.001 
0.004 
0.002 
0.003 
0.011 
0.004 
0.005 


Table 11 


B(%) 


0.011 
0.014 
0.012 
0.013 
0.005 
0.030 
0.006 
0.008 
0.022 
0.060 
0.005 
0.007 
0.023 


MLEs from Model (B’) and Standard Deviations 
for 13 EPS of 1990 PES 


Dp; (SD) 


0.92406 (12.68 
0.99464 (2.79 
0.93896 (5.38 
0.99999 (2.65 
0.94166 (8.28 
0.97922 (4.03 
0.97600 (2.32 
0.95034 (11.59 
0.97756 (4.47 
0.99217 (1.50 
0.94239 (10.46 
0.97561 (5.07 
0.97895 (3.10 


KE Es OS OO OF Me PO XK 


10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10~5) 


By (SD) 


0.72114 (18.79 
0.93536 (8.30 
0.87597 (7.01 
0.98070 (3.64 
0.83080 (12.13 
0.95154 (6.03 
0.90438 (4.30 
0.86933 (17.06 
0.83141 (11.12 
0.96733 (3.06 
0.74316 (16.58 
0.92614 (8.10 
0.99029 (2.42 


EOS: ao eG SN EK | RS OSD OK. DRS 


10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 
10~5) 
10~5) 
10-5) 
10-5) 
10-5) 
10-5) 
10-5) 


N (SD) 


6,456,833 (446) 

9,285,474 (92) 
25,832,352 (279) 
30,731,889 (781) 
10,603,717 (306) 
14,274,182 (64) 
48,717,792 (338) 
4,272,459 (159) 
12,097,806 (285) 
39,654,306 (90) 

7,729,158 (359) 
11,350,674 (101) 
26,983,168 (355) 
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Table 12 


Undercount Percentage and Bias Estimates 
for 13 EPS of 1990 PES 


EPS UC(DSE) UC(P) UC(E) UC(T) Bias(P) Bias(E) Bias(T) 
iia 6.40 5:99 5.30 4.89 0.41 1.10 leo 
2 =(0).69" 083i) 205 81520 0.14 0.36 0.51 
3m Sey 4.79 eos) 4.72 0.80 0.06 0.87 
4 ORI EMT 1.33 B39 2.06 1223 B29 
a 5.03 4.49 4.68 4.15 0.53 0.35 0.88 
6 1.22 1.06 0.99 0.83 0.16 0.23 0.39 
7 1.77 1.73 1.50 1.47 0.03 0.26 0.29 
8* Bee 3.26 3.46 3.20 0.26 0.06 0.32 
y) 1.05 1.26 1.00 12 lee 0722, 0,05 017, 

10 0.41 0.34 0.36 0.29 0.07 0.05 0.12 

ital 5.26 4.43 S577) 4.94 Oso Olen 0.32 

12 1.89 1.56 Sy! 1.19 0.32 0.38 0.70 

13 179 129 1.28 0.78 0.50 0.51 1.01 


for minority post-strata, or the other way around. In Table 11, 
we give maximum likelihood estimates and standard devia- 
tions under model (B’). Heterogeneity in the capture prob- 
abilities is significant. This heterogeneity together with the 
variation in the matching error rates suggests that model 
(B’) is more appropriate than model (B). The asymptotic 
standard deviations in Table 9 and 11 appear unusually 
small comparing to the sample size of N. Ding (1993b) 
shows that this is a typical feature of the dual system 
problem when the capture probabilities are very high, as 
it is the case in census application. Despite very narrow 
confidence intervals, simulation studies in Ding (1993b) 
show that the asymptotic normal approximation being 
used is highly accurate in terms of coverage probability. 
Table 12 provides estimates of matching bias of various 
sources in the undercount estimate by the usual DSE. 
UC(DSE) is the undercount estimate from the DSE defined 
in the same way as for the 1986 TARO estimate; UC(P) 
is the undercount estimate computed by MLE from 
matching error model to adjust for matching bias in 
P-sample, and Bias(P) = UC(DSE) — UC(P). Again, 
following Hogan and Wolter (1988), we define the bias in 
E-sample operation by Bias(E) = ER/(CR + ER) — 
EP/(CP + EP), and the undercount estimate correcting 
for E-sample error by UC(E) = UC(DSE) — Bias(E). 
Finally the total matching bias by both P-sample and 
E-sample is Bias(T) = Bias(P) + Bias(E), and the under- 
count estimate correcting for both sources of error is 
UC(T) = UC(DSE) — Bias(T). Note that it is possible, 
as observed for EPS 2 and 4 in Table 12, that undercount 
estimate is negative, thus indicating an overcount instead. 
This happens when the DSE (or MLE) is less than CEN, 
the total census enumeration. The dual system data 
represents ‘‘corrected’’ census counts with erroneous and 
other incorrect enumerations excluded from CEN. 


156 Ding and Fienberg: Dual System Estimation of Census Undercount in the Presence of Matching Error 


For each of Bias(P), Bias(E) and Bias(T), a positive 
estimate indicates a upward bias in the undercount esti- 
mate from the DSE by ignoring the corresponding source 
of error, that is, UC(DSE) should be reduced by the 
estimated bias to account for that source of error. For each 
of UC(DSE), UC(P), UC(E) and UC(T), we get signifi- 
cantly higher undercount figures for each of the five 
minority post-strata, 7.e., EPS 1,3,5,8 and 11. For both 
Bias(P) and Bias(E), all the bias estimates are positive 
except for Bias(P) for post-stratum 9 and Bias(E) for post- 
stratum 11. This supports the common belief that there 
is usually an upward bias attributable to matching errors 
in the undercount estimate by the DSE, except for some 
non-minority geographical areas where in fact there is 
disproportionately large share of erroneous enumerations. 

The effects of the two types of matching errors are well 
understood. False nonmatches results in upward bias and 
false matches produce downward bias. The nature of the 
overall matching bias is then dependent upon which type 
of matching error dominates. By computing undercount 
estimates for 1980 Census data with selective pair of y and 
68, Ding (1990) concludes that due to high capture prob- 
abilities in the census application of the capture-recapture 
technique, the matching bias is dominated by the false 
nonmatch rate when the false nonmatch rate (y) and the 
false match rate (@) are about the same magnitude. This 
point can be easily confirmed here. EPS 4 has the largest 
estimate of y, Y = .021% and results in the largest 
Bias(P) = 2.06%. EPS 3 and EPS 4 have about the 
same estimate of B, 8, .012% and .013%, respectively, 
but EPS 3 has much smaller Bias(P) = .80%, due to 
smaller estimate of y, y = .010%. About a .01% dif- 
ference in ¥ gives dramatic difference in Bias(P). For 
matches and nonmatches with complete data, Fay ef al. 
(1988, p. 53) state ‘‘Because of sometimes difficult nature 
of the matching work, false nonmatches probably repre- 
sent a greater concern than false matches’’. The data 
analyzed by our methods include both complete data and 
data produced as a result of the Bureau’s imputation 
procedure. The sensitivity of our estimates to y lends 
some support to the statement by Fay ef a/. when both 
matching for complete data and matching for imputed 
data are considered together. On the other hand, a down- 
ward bias can be observed when @ is much larger than OTe 
For EPS9, 6 = .022%, about 10 times as large as 
Y = .002%. Thus false matches dominate false non- 
matches for this stratum, and we see the only negative 
(downward) bias, Bias(P) = —.22%. 

For a specific matching procedure there is an inevitable 
trade-off between matching errors and unresolved cases. 
Depending on the extent of unresolved cases and the 
imputation algorithm used, the resolution process might 
yield a significant number of false matches. The empirical 
evidence accumulated by the Bureau of the Census, as we 
note above, lends some support for the ‘‘unbiasedness”’ 


of the missing data mechanism used in the imputation 
process in our example, but further evidence on the issue 
is desirable. 


6. SUMMARY 


In this article, we have presented models and methods 
for the estimation of population total and census under- 
count that corrects for matching bias of the usual dual- 
system estimate in the presence of matching errors. Two 
sources of information are combined in the estimation 
procedure, the dual-system or capture-recapture census 
data, and the data from a matching error study (rematch 
study). The accuracy of our estimates relies on the assump- 
tion that the rematch is error free. Matching error rates 
are likely not to be homogeneous over different population 
strata. Model (B’) allows for heterogeneity of matching 
error rates across various population strata but requires 
stratified rematch data to estimate the error parameters 
within strata. The methods presented here generalize the 
standard theoretical framework for the use of maximum 
likelihood estimation to accommodate matching errors. 

We can adjust for erroneous enumerations in the esti- 
mate of EE by the use of rematch data for the E-sample. 
We obtain an overall matching bias in the DSE by adding 
two bias components from the P-sample and the E-sample. 
Our analysis of the 1986 Los Angeles test census data 
indicates that the upward bias of the DSE in the estimate 
of the census undercount is just under 1%, thereby lending 
support to the 1% value used by Hogan and Wolter (1988) 
in their evaluation study. For the analysis on 1990 Census 
data, the computational results not only agree with under- 
stood aspects of matching bias, but also offer findings that 
were not previously known. 

For simplicity, we have assumed that the PES is (allowing 
for stratification) based on simple random sampling. The 
models still need to be adapted to account for the complex 
sampling design actually used (see Hogan 1992, 1993). 

It has been known that the perfect matching assumption 
does not hold in the application of dual system estimation 
in the U.S. census. The matching problem in the use of 
the DSE has two components. The first component involves 
the missing P-sample enumeration status. The second 
involves errors in classifying P-sample people as enu- 
merated or not. The present paper provides a method to 
address both components using dual system data adjusted 
for imputed enumeration probabilities, and can be of 
possible value in future censuses provided that the models 
are adapted to handle the complex survey design of the 
PES. Ding (1993c) develops estimates to directly address 
the first component by modifying the usual DSE method 
and describes the relationship between the proposed esti- 
mates and those that result from the application of the 
Census Bureau’s imputation scheme for missing P-sample 
enumeration status (Schenker 1988, Belin et a/. 1993). 


Survey Methodology, December 1994 


ACKNOWLEDGMENTS 


Fienberg’s work was partially supported by a grant 
from the Natural Sciences and Engineering Research 
Council of Canada to York University, Toronto, Canada. 
The authors are grateful to Mary Mulry for furnishing 
data on 1990 Decennial Census, to Joe Sedransk for 
suggestions, and to Jay Kadane, Larry Wasserman and 
Mike Meyer for commenting on an earlier version of this 
work. An Associate Editor and two referees provided 
comments that have led to a sharpening of the discussion. 
The basic models in this manuscript were first developed 
as part of the first author’s Ph.D. thesis at Carnegie 
Mellon University. 


APPENDIX 


Comparison of Estimates from Weighted and Unweighted 
P-Sample Data 


For simplicity, we assume a weight k > 1 for the 
P-sample and consider the usual dual system estimation 
problem. Let {x;;} be the cell counts in the 2 x 2 table 
for weighted P-sample data and census enumerations, 
i,j = 1,2 andij ¥ 22. One could make inference with 
unweighted P-sample data and census enumerations 
deflated by a factor of k to get cell counts {y},7,/ = 1, 
2andy # 22. Thenx = kyyj,y A 22,andx4 = ky 4, 
X,, = ky, ,. Let the usual dual system estimates derived 
from {x} be 6, H2 and N,,, and estimates from {viz} be 
G\, G, and N,,. The estimates are (Bishop ef al. 1975, 
chap. 6) By = Xy1/X41 = Yu/Y41 = Ms Br = X/%14 = 
VulVi+- = Ny = 14% 41/% = N14 41/1 = KNy. 
Thus if one considers the unweighted P-sample data and 
uses N: = KN,, to estimate the population total, then 4, 
g, and N: give the same point estimates as f,, 6, and N,, 
from weighted P-sample data. From the asymptotic 
normal distribution of the estimates (Ding 1993b), we 
have Var(N,,) = KVar(N,,), Var(g,) = kVar(p)), 
Var(G,) = kVar(p.). Then Var(N+) = kVar(N,,), 
and G,, g> and Ns have larger variance than p,, p> and 
N,,, respectively. To compute estimates with unweighted 
P-sample data, one needs to know k and { y;;}. We empha- 
size that the trivial case of a constant sampling weight for 
all cases in the same post-stratum is assumed here for 
simplicity of discussion. However, the real situation can 
be complex. For example, Blacks may be sampled at a low 
probability in a White stratum and are then combined with 
other Blacks sampled with much higher probabilities. 
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A Hypothesis Test of Linear Regression Coefficients 
with Survey Data 


PHILLIP S. KOTT! 


ABSTRACT 


This paper discusses testing a single hypothesis about linear regression coefficients based on sample survey data. 
It suggests that when the design-based linearization variance estimator for a regression coefficient is used it should 
be adjusted to reduce its slight model bias and that a Satterthwaite-like estimation of its effective degrees of freedom 
be made. A very important special case of this analysis is its application to domain means. 


KEY WORDS: Design-based; Domain mean; Effective degrees of freedom; Model-dependent; Probability order. 


1. INTRODUCTION 


Most of statistical theory is analytical in nature. One 
begins with a set of data and a fairly general stochastic 
model believed to have generated that data. Statistical 
theory is then invoked to estimate the parameters of the 
model and to determine the accuracy of those estimates. 
Ultimately, the original model may be pared down as the 
result of a series of statistical tests which often take the 
form of investigations into whether particular parameter 
values may be reasonably inferred to be zero. 

The bulk of survey sampling theory, by contrast, is not 
analytical but descriptive. There is a finite population of 
interest. Information about this population can, in prin- 
ciple, be summarized by means of one or more descriptive 
statistics (for example, the population mean and median). 
The survey statistician is constrained by time or budgetary 
considerations to estimate such statistics using only a 
sample of population units. He (she) often faces a two-fold 
problem: first a method of sample selection needs to be 
chosen, then the population statistic(s) needs to be estimated 
from the sample. Although it is possible to construct a 
model-dependent statistical theory for these purposes (see, 
for example, Royall 1970), most survey statisticians invoke 
a model-free approach known as design-based sampling 
theory. In this theory, it is not the sample data values that 
are stochastic (as they are in model-dependent theory) but 
the sample selection process. Rao and Bellhouse (1989) 
provides a useful summary of both design-based and 
model-dependent theory and of attempts to synthesize the 
two approaches. 

The main concern here will be in the testing of a single 
hypothesis about linear regression parameters. We will 
assume that the model is correct and that model errors are 
normally distributed with a possibly complex covariance 
structure. Unlike Wu ef a/. (1988), we will not explicitly 
model the error structure (except, perhaps, at a latter 


stage). Rather, we will focus our attention on a f-statistic 
calculated using the linearization variance estimator. That 
this variance estimator has desirable robustness properties 
from a model-dependent point of view has been demon- 
strated by Skinner (1989) and Kott (1991). 

This paper will provide methods for reducing the model 
bias of the linearization variance estimator and for deter- 
mining its effective degrees of freedom. A very important 
special case of this analysis is its application to the estimated 
variance of domain means and the difference of such 
means. Since the analysis in this paper is strictly model- 
dependent, the terms ‘‘bias’’ and ‘‘variance’’ will refer to 
model bias and model variance unless otherwise specified. 


2. THE MODEL 


Suppose we have a population of M elements that can 
be fit by the linear model: 


yu = XuB + Ey, (1) 


where yyy isanM x 1 vector of population values for the 
designated dependent variable; 
Xjy isan M xX K matrix of population values for 
the K designated independent variables; 
B isaK x 1 vector of regression coefficients; and 
€y is anormally distributed random vector with 
mean 0), and variance Lyy. 

A random sample, S, of m distinct elements is drawn 
from the population. To allow a certain amount of gener- 
ality in the sampling design, we assume that the population 
is divided into L strata. From each stratum h, n,, distinct 
clusters of elements are randomly sampled and denoted 
Uni, U2» ++» Unn,- A random sample of /m,,; elements is 
selected from each cluster /. The clusters are also referred 
to as primary sampling units. There aren = Yn, primary 
sampling units in the sample. 
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Each sampled element has a designation Aji, where h 
is its stratum, // its primary sampling unit within /, and 7 
the element itself within Aj. Let p,;; be the probability that 
element /ji is in the sample, and let w,;; = m/(Mpp,;;) be 
the sampling weight of the element. Observe that the 
sampling weights have been normalized so that if pj; 
equals the sampling fraction, m/M, then w,;; would be 
unity. 

The linear model in (1) also applies to the elements in 
sample S: 


Ys = XsB + Es, 


where ys, for example, is the m x 1 vector of sampled 
values for the dependent variable. Let €,; = (€nji, €nja; 
2 We Chm nj be the error vector for the elements in primary 
sampling unit hj. Now, €,; can be arranged so that the 
€;,; are stacked one on top of the other. Let Var(€,;) = 
E(€jj;€n;') be denoted by the m,; X m,; matrix X,,, 
which need not be diagonal. We assume that the €,,; are 
uncorrelated across primary sampling units, so that Ls is 
block diagonal. 

The design-based estimator for ( is the weighted least 
squares estimator: 


ba CX Xs ee Wes 


where W is the m X m diagonal matrix of sampling 
weights. The g-th diagonal value of W is the sampling 
weight associated with the g-th element of the sample. 
Clearly, by is an unbiased estimator of B under the model 
in (1). 

One can simplify the notation for by by letting C be 
the k x mmatrix (X{WX;) ~'XéW, so that by = Cys. 
Let D,; be a m xX m diagonal matrix with 1’s corre- 
sponding to the sampled elements of hj and 0’s elsewhere. 


Furthermore, let C,; = CDj,,;. Finally, let rs = ys — Xs by 


be the vector of residuals. 
The Taylor series or linearization estimator for the 
mean squared error of by (Shah ef al. 1977) is 


iG Ap 
mse = De COAG ee 15) ye Ajj tsrs Ajj, (2) 


j= = 


where Ay; = Cy; — Ny AS. Chg, and the summation is 
over all the primary sampling units in stratum /. The terms 
‘“‘Taylor series”’ and ‘‘linearization’’ refer to the derivation 
of mse using design-based sampling theory. Kott (1991) 
shows that mse is a nearly unbiased estimator of the model 
variance of by under reasonable conditions. 

It should be noted that in their derivation of mse, Shah 
et al. assumed that the primary sampling units were chosen 
with replacement. Here, as in Kott (1991), we are assuming 
that the primary sampling units are distinct which suggests 
that they were selected without replacement. The reason 


for this discrepancy is that the assurance of independence 
among the selected primary sampling units within a stratum 
in design-based theory and model-dependent theory has 
almost opposite requirements. The discrepancy goes away, 
however, if we assume that the primary sampling units 
were chosen without replacement but that the goal of 
design-based regression theory is not to estimate a finite 
population regression parameter but the limit of that 
parameter as the population (and the number of primary 
sampling units per stratum) grows arbitrarily large. See 
Fuller (1975). 

If the model in equation (1) holds and L > 1, then 
there is an alternative to mse that is also nearly unbiased. 
It has the same form as equation (2) except that all n 
sampled primary sampling units are treated as if they came 
from a single stratum (ZL = 1). Since the alternative can 
be expressed using equation (2), there is no need to treat 
it separately in the analysis that follows. 


3. A CONVENTIONAL DESIGN-BASED 
t-STATISTIC 


The estimator by is a K-vector. In this section we will 
be interested in the f-statistic used to test the univariate 
hypothesis that g8 = ©, for some K element row vector 
gq = (G1, G2, ---» Gx). The most common example of 
such an hypothesis addresses whether a particular element 
of B = (6, ..., Bx), say By, is zero. In this example, all 
of the g, would be zero except gq, which would be 1; Qo 
would also be zero. 

If the model in (1) and the null hypothesis that g8B = 95 
are true, then 


8 = (qby — 9o)/{qVar(by)q’}” 


would be normally distributed with mean 0 and variance 1. 
If Var(by) were known, the null hypothesis could be 
tested by comparing the statistic 8 to a standard normal 
table. Unfortunately, Var(by,) must be estimated from 
the sample. Conventional design-based practice is to 
compare the statistic 


t = (qbw — 9o)/(qmseq’)”, (3) 


toa Student’s f distribution withn — Lor(n — L — K) 
degrees of freedom (see Shah ef a/. 1977). 

The primary goal of this paper is to investigate and 
then modify the rather ad hoc practice described above 
using the model in equation (1) and our assumptions that 
Xs is block-diagonal. This will be done by investigating 
s? = gmseq' as an estimator for v* = qVar(by)q’. 
First, s* will be adjusted to reduce its bias; then, a better 
determination of the adjusted estimator’s effective degrees 
of freedom will be established. 
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4. THE MODEL BIAS OF s* 


The analysis to be conducted is asymptotic. Many of 
the results rely on the assumption that ”, the number of 
primary sampling units in the sample, is large. (Formally, 
we should assume that there are infinite sequences of 
statistics taking on values as ” grows arbitrarily large.) If 
nis large, then so too must be M and m, the number of 
elements in the population and the sample, respectively. 
We will assume that max{/7,,;} is bounded by a finite 
value, say 719. Thus, m is bounded by 7719” and the number 
of nonzero elements in the block-diagonal matrix Ys is 
bounded by mn. 

The number of columns of Xs, K, is assumed to be 
fixed, but we have some flexibility concerning the number 
of strata, L. Either L can stay fixed as n grows arbitrarily 
large with the n,/n ratios converging to fixed positive 
limits, or L/n can converge to a fixed positive limit with 
max{n,,} bounded. 

Our concern here is with providing sufficient condi- 
tions for the subsequent analysis in the text to hold. The 
random variable ¢ (formally, the infinite random sequence 
{d,}) will be said to be of probability order n~°, i.e., 
$ = Op(n~°), when | E(¢7) | < B/n”° for some finite B. 
Similarly, the random matrix ® will be said to equal 
Op(n~°) when each element ¢;in ® satisfies | E(;;) | < 
B/n°. When ¢ is not random, the P subscript on O is not 
needed. The same is true for O. 

The following assumptions are reasonable given the 
structure that has been laid out: 


(1) C = (X’WX) ~!X’'W exists and is O(1/n), and 
(2) E(Zy;) = Xp, + O(1/n), where De Hyre. 


Assumption | assures us that Var(by,) = CX;C’ = 
O(1/n) since there are m elements in the rows of C and 
no more than /76n non-zero elements in Ly. 

The variance of gbyw can be rewritten as v7 = 
YY Vpj/n?, where Vay = 17 8njZs Snjs Sri = FCDp;, and 
Dj; is a diagonal matrix with 1’s corresponding to the 
sampled elements of primary sampling unit Aj and 0’s 
elsewhere. Similarly, s* = gmseq’ can be rewritten as 


Nh 


L 
; aX, Cia? \iin nl) y) (Sn; — Sa)’shs(Snj — Bn) 


a 
h=1 j=l 
; (4) 
= ¥ Cy Te 1) , [nj Us 8h 
— 2gnXsghj + SndsBil, 
where g;, = Y 8,;/n,, the summation is across the inh, 


and Ls = YY DyrsrsDjy - 
Both g,,; and g, are O(1/n) because C = O(1/n) and 
D,; has a bounded number of non-zero values. Thus, 
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E'(gy;% Ys8ij) = ag bu ™s8hi e One Py E(g,&Us8ij) = a 
&nUshj + O(n~*), and E(8,Es8i) = g,Usgi, + O(n’). 
Consequently, E(s* — v*) = O(n~?). 

Since rs = UI, — XC)E€s and E(€s€5) = Zs, 
E(rgrg) = Ly — XCX 5 — UsC’X’ + XCXSC’X’. 
From equation (4), we can see that E(s?) = v? — ~ 
where R = pas (1n/ [Mp cee hy) e (7; — 81,)ZA(8nj — 8n) 
and Z = 2XCY, — XCX;C’X’. Now Z’= O(1/n), 
because C = O(1/n), X has a fixed number of columns, 
and the number of non-zero terms in any column of Ys is 
bounded. This implies R = OG -)- Thus, —R/v’, the 
relative bias of s*, is O(1/n). 

An alternative estimator for v? with a reduced relative 
bias is 


Se 8-7 (1 = Rh), (5) 


where 


a £,)Z (Ln - 2) , 


L Np 
-{> i, / ap ye (Snj 
a 


h=1 


and 


LZ DXCLE= XCLC XxX 

In equation (5), R/s* is used to estimate R/v’. The 
variance estimator s? has been proposed here rather than 
the more obvious s* + Ras ad hoc compensation for the 
slight relative bias of R as an estimator of R. 


5. THE RELATIVE VARIANCE OF THE 
VARIANCE ESTIMATOR 


Let e,; = gpj€s so that Var(e,;) = vz;, and recall that 
= YY vij/n?. If én; = ngyjrs, then the random 
variable s* can be re-written as 


Nh 


1b 
5 = ee (n),/ [Mp ally > Ce, oT é,)° 
h= j=1 


2 (mp/ [ny — 1IDEY¥ (en — en)? 


ey; = VANS ye Bi) ot 


where A = 2XCese5 
to show that 


— XCe,;e;C’ X’. It is now possible 


Ih Nh 


at? eS (1,/ [Mn i 1]) De (2p; ae e;,)° 


h=1 j=) 


> 
Seo 12 


aia Op(n=?/A); 
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Consider a random variable with a x? distribution with 
F degrees of freedom. Its relative variance is 2/F. This 
suggests a Satterthwaite-like determination of the effective 
degrees of freedom of s? (see Satterthwaite 1946); namely, 


(nv)* 
oS (8) 


is Ap 
ya os Vij + Vij vig / (Mn = 1? 
(EX 


i=1 LAS 


which is approximately 2 divided by the relative variance of 
sf (since se ~ n~?Yy (Lj eny + Lie Cry Cnj/ (Mn — 1)3)- 

What is being recommended here is that one tests 
whether g8 = Qo by assuming under the null hypothesis 
that 


tx = (qbw — 9o)/Ss, (7) 


has a Student’s ¢ distribution with F degrees of freedom, 
where F is determined using equation (6) and making 
some assumptions about the v,,;. Let us call this test the 
adjusted t-test. 


6. A SIMPLE EXAMPLE 


Consider a simple random sample of 7 units, n, of 
which are in a subset of the sample denoted by A and n, 
in the complement denoted A. Let y,; be the observed 
value for unit 7. Suppose the following linear model holds: 


¥, FG By 24 = d)Pa wk Ep (8) 


where d; = 1 is unit jis in set A, and Oif jis in A; and the 
€; are independent normally distributed random variables. 


Assuming homoscedastic errors, both the model- 
dependent and design-based regression estimator for G, is 
the simple domain mean, ¥4 = YVie4y;/n;. The lineariza- 
tion estimator for the variance of this estimator is simply 
vy = (n/[n — 1]) Dies; — ¥4)7/n?. (It should be 
noted that when a domain mean is viewed as an analytic 
parameter, its variance requires no finite population 
correction; see Fuller 1975). 

This linearization estimator, v,, differs from model- 
dependent variance estimator: vay = [ Vieg(i — Payeae 
vica (i — ¥a)71/[n1)(n — 2)]. The advantage of v; is 
that, unlike vjy, it is asymptotically unbiased under the 
model even when the €; are heteroscedastic. This point 
was noted by Skinner (1989) and Kott (1991). Unfortu- 
nately, there still may be considerable bias for finite n. For 
example, when” = 100 andn, = 10, the relative bias of 
v, is approximately 10%. We can see this by noting that 
ve = YieaQi — Ia)?/(mim — 1)) = ([n = 1) /n) 
(n,/[n, — 1])v, is exactly unbiased. 


Continuing the example: If one were to calculate a 
t-statistic using conventional design-based practice, he 
(she) would not only use a biased variance estimator but 
would also assume that the statistic has 97 or 99 degrees 
of freedom (100 sampling units minus one strata minus 
two regressors, were this last subtraction is not always 
performed). Under ideal conditions (homoscedastic errors 
within set A), however, the f-statistic calculated using 
ve has a Student’s ¢ distribution with only 9 degrees of 
freedom. 

Applying equation (5) to the linearization variance 
estimator, v,, produces a variance estimator virtually 
identical to vz (since R = [v,/n,[1 — n/n) ], sz differs 
from vz by only 0.1%). Assuming identically distributed 
errors within sets A and A, calculating the effective degrees 
of freedom, F, with equation (6) yields 9.99. This is almost 
exactly one degree too many but clearly better than 97 
or 99. 

A natural hypthesis to test is whether the domain 
means, 8, and 6, in equation (8) are equal. In other 
words is 8B; — By = O89 = 0? Assuming that all units 
have the same variance, the adjusted f statistic is 


nye Viltihes », yi/N2 


ime led icA 


(foe Ry os 


s? = [n/(n — 1)] [Lier — Fa)? /n? 
+ Yiea de — Ja)? m3, 
and 
R = [n/(n — 1)] (Lies i — a)?/ni) (1 — 14/n) 
+ (YieaQi — ¥a)?/nz) (1 — m/n)). 
To calculate the effective degrees of freedom for 
ns*/(n — 1) - and thus ¢. - using equation (6), note that 


L = 1,andv; « 1/n? fori€ A while v; « 1/nzfori¢ A. 
As a result, 


—— 
(1/my + 1/n)? 


> 


(1/n?. + tng. [1 /ny, + 1am) 1 /ny inn 


which is 12.3 when n,; = 10 and ny = 90. The actual 
degrees of ns*/(n — 1) (i.e., 2 divided by its relative 
variance) is reasonably close, 11.1 (the relative variance 
of ns*/(n = 1) is 2[(ny = 1) /nf + (ny — 1) 09] 
divided by [ (nm, — 1)/n? + (ny = 1)/n3)?). 
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What this synthetic example principally shows is how 
misleading conventional design-based practice can be even 
with an apparently large sample size. The adjusted f-test 
is clearly a giant step in the right direction. 

It is tempting to try to avoid making an assumption 
about the v,; and to estimate F with 


ib, Nh 
(is,)) Yor P2873 
J eT ea 
L Nh 
\B oo Bot Shy Sa Uh, — 1] 
i=1 \j=1 Ai 


where Shi = nN (pj Polic. Although f is a consistent 
estimator of F, its use can produce misleading results as 
we shall see. 

Repeated application of equation (9) on 10,000 simulated 
data sets constructed under the assumption that the €; 
in equation (8) are normal, independent, and identically 
distributed yielded an average f value for the variance of 
J, of approximately 11.2 with a standard deviation of 
about 3.5. In addition to its variability, the average f value 
is greater than F. This is due to the denominator of equa- 
tion (9) itself being a random variable. It happens that the 
value of 1/f is roughly 0.100 (= 1/9.99), as expected. 
Thus, even though the use of fin equation (9) may seem 
appealing, it is not recommended. 


7. ANOTHER EXAMPLE OF A 
DOMAIN MEAN 


Faced with the simple example of the last section, most 
design-based statisticians would simply treat the units 
sampled from set A as an independent simple random 
sample. The linearization and model-dependent variance 
estimator would then coincide. In practice, however, 
samples often involve clustering, stratification, and unequal 
probabilities of selection. When the domain of interest is 
not a design stratum, it usually becomes impossible to 
separate out the domain’s sampled elements (which need 
not be primary sampling units) and treat them as an 
independent random sample. 

An example of such a complex sample is the 1985 
Continuing Survey of Intakes by Individuals (CSFI]). This 
was a Stratified, multistage survey of the dietary intakes 
of women from 19 to 50 years of age and children from 
1 to 5. There were roughly 140 women in the sample who 
described themselves as black and 1,150 who described 
themselves as white. 

Assuming that a dietary intake value for each individual 
was independent and identically distributed, values of the 
relative variance of the linearization variance estimator 
(R/s? from equation (5)) and its effective degrees of 
freedom (F from equation (6)) were calculated for the two 
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race domains. The relative bias for white women was .003, 
while the effective degrees of freedom were 48.1. For black 
women, the relative bias was 0.026, and the effective 
degrees of freedom 10.1. Thus, even with a fairly large 
sample size, the effective number of degrees of freedom 
for black women was relatively small. The conventional 
determination of degrees of freedom was around 60 
(120 PSU’s minus 60 design strata). 


8. DISCUSSION 


As pointed out earlier, the use of design-based tech- 
niques can often provide protection when the model in 
equation (1) fails. Unfortunately, this protection can not 
be addressed in the strictly model-dependent framework 
adopted here. It would be unrealistic, however, to expect 
a conventional design-based f-statistic to behave any better 
when the model in equation (1) fails than when it holds. 

One potential problem of the modified design-based test 
statistic suggested here occurs when the model in equation 
(1) does not fail: it may not be very powerful. Power can 
be lost by estimating regression coefficients with sampling 
weights and by not modelling the error structure directly. 

This loss of power is due to the original design-based 
formulation and not to our modification of it. In fact, 
s? is a design consistent estimator of the design mean 
squared error of by whenever s* is. This is because R/s? 
in equation (5) is also Op(1/n) from a design-based point 
of view assuming that the first stage of sampling is con- 
ducted with replacement. 

Returning to the simple example of Section 6 can illus- 
trate the issue of power forcefully. The model-dependent 
and design-based estimates are the same. If all the €; are 
assumed to be identically distributed, then the model- 
dependent variance estimator, w,, which depends on the 
assumption of homoscedasticity, is unbiased and has 
98 degrees of freedom. The adjusted design based vari- 
ance estimator is also virtually unbiased, but it has only 
9 degrees of freedom. 

Often in practice, it will be prudent to sacrifice power 
for robustness. When that is the case, equation (6) provides 
an attractive method of measuring how much power may 
be lost using a modified design-based ¢-test (equation (7)) 
when the assumptions of the model are, in fact, correct. 
Furthermore, the equation lends itself to sensitivity 
analyses in which the effects of alternative assumptions 
about the v,; can be evaluated. 
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Matrix Masking Methods for Disclosure 
Limitation in Microdata 


LAWRENCE H. COX! 


ABSTRACT 


The statistical literature contains many methods for disclosure limitation in microdata. However, their use by 
statistical agencies and understanding of their properties and effects has been limited. For purposes of furthering 
research and use of these methods, and facilitating their evaluation and quality assurance, it would be desirable 
to formulate them within a single framework. A framework called matrix masking — based on ordinary matrix 
arithmetic - is presented, and explicit matrix mask formulations are given for the principal microdata disclosure 
limitation methods in current use. This enables improved understanding and implementation of these methods by 


statistical agencies and other practitioners. 


KEY WORDS: Statistical confidentiality; Survey data processing; Mathematical methods. 


1. INTRODUCTION 


In this Information Age critical activities of society are 
fuelled by data. Users of statistical data rely especially 
upon government statistical agencies to collect reliable 
data and disseminate it in a timely and broadly useful way. 
Prior to the 1950s, data were released only in printed, 
tabulated form. Beginning in the 1960s, data at the indi- 
vidual respondent level - statistical microdata - began to 
be released by the U.S. Government. 


At present, use of microdata outside statistical agencies 
for research and policy analysis is often curtailed because 
appropriate data are not released to users due to confiden- 
tiality concerns. For three decades statistical agencies have 
wrestled with policy and technical issues in microdata 
release, many of which remain unresolved (Federal 
Committee on Statistical Methodology 1994). The purpose 
of this article is to present a class of matrix transforma- 
tions of microdata intended to help deal with this issue. 


Duncan (1990) and Duncan and Pearson (1991) charac- 
terized several disclosure limitation methods for microdata 
— microdata masks - by means of matrix addition and 
multiplication, and named such characterizations ‘‘matrix 
masks.’’ Cox (1991) generalized the concept of matrix 
masks, and extended the characterization to other micro- 
data masks. The characterization of microdata masks as 
matrix masks offers conceptual and statistical advantages. 
Matrix masking provides a simple language to represent, 
compare and evaluate microdata masking methods. Matrix 
masking expresses complicated, diverse methods in a 
form presentable to a wide audience including statisticians 
and data users, and offers a standard format to develop 
and optimize the efficiency of transportable microdata 
masking software. 


In this paper, the concept of matrix masks is developed 
in a mathematically rigorous way. Explicit matrix mask 
formulations are provided for the principal microdata 
masking methods in current use, extending those presented 
in Duncan and Pearson (1991) and Cox (1991). This enables 
straightforward implementation of these methods in soft- 
ware, and facilitates closer examination and use of microdata 
masks by statistical agencies. This should lead to improved 
understanding of the properties of microdata masks and 
much needed understanding of their effects on data use. 


2. MATRIX MASKS 


2.1 Definitions 


A microdata file containing p attribute values for each 
of n (respondent-level) data records can be represented as 
ann X p matrix X whose entries are denoted xj. Unless 
stated otherwise, X contains no missing values. A matrix 
mask (A, B, C) is a transformation of X of the form: 
X = AXB + C, with A, B # 0, involving ordinary 
matrix addition and multiplication. As A operates across 
the rows of X, A is called a record transforming mask. B 
is an attribute transforming mask, and C is a displacing 
mask (Duncan and Pearson 1991). 

An elementary matrix mask of X is a matrix mask of 
the form AX, XB, or X + C. Iterations of (elementary) 
matrix masks of X are also matrix masks of X. Therefore, 
a matrix mask of Xhas the form X = AXB + C, where 
either X = Xor X has been obtained from _X by applica- 
tion of a sequence of elementary matrix masks. An impor- 
tant advantage of this definition is to enable different 
statistical disclosure limitation methods to be applied selec- 
tively to arbitrary subsets of the records and attributes 
of X (Section 4). 
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The matrices A, B, C are not necessarily fixed. For 
example, acommon mask for numeric attributes involves 
addition of random noise (Tendick 1991), so that Cis a 
random matrix. The matrices A, B, C may depend upon 
X. For example, to displace X by additive random noise 
proportional to size, draw the cj randomly from a normal 
distribution with mean zero and standard deviation a multiple 
Of sap -landyset Xu eX huGe Orawithu Aras, 
M = AX issufficient for ordinary least squares regression 
(Duncan and Pearson 1991). 


2.2 Notation 


I denotes the identity matrix. Z denotes the matrix all 
of whose entries are zero, and J the matrix of all ones. Uj; 
denotes the matrix all of whose entries equal zero, except 
uy = 1. /is always a square matrix; Z, J and U;; need not 
be. The U;; matrix, when used as a pre-(post-)multiplier 
retains the values of only one row (column) of the matrix 
it multiplies. The dimensions of submatrices may vary 
between or within individual formulations and will be 
specified for clarity. 


3. REPRESENTATIONS OF DATA MASKS AS 
ELEMENTARY MATRIX MASKS 


3.1 Removing and Selecting Microdata 


The most intuitively obvious method for limiting dis- 
closure is to withhold certain microdata from release to 
data users. Typically, these data are associated with the 
highest disclosure risk and may require suppressing attri- 
butes (columns) or suppressing records (rows) of _X prior 
to release. 

Attribute suppression of the k-th attribute can be 
represented as an attribute transforming mask X = XB, 
where B is the p X (p — 1) block matrix: 


Supp(k) = Lwlas 
Lik: 


whose upper /-matrix is of dimension (k — 1) x (k — 1), 
whose lower /-matrix is of dimension (p — k) X (p — k), 
and whose central Z-matrix is of dimension1 x (p — 1). 
An alternative formulation is Supp(k) = Yj<xUjj + 
Le sielyet- 

Suppression of several attributes can be represented as 
a product of B-matrices of this form. For example, 
Supp(A)Supp(/) first suppresses the k-th attribute of X, 
and then suppresses the j-th attribute of the resulting 
n X (p — 1) dimensional matrix XSupp(Aé). The dimen- 
sions of Supp(k) and Supp(j) are p X (p — 1) and 
(Beek) Se Pasa) 


It is sometimes necessary to delete individual records 
from _X. For example, a respondent may have high iden- 
tification risk, or a record may be out of scope or spurious. 
Record deletion of the h-th record can be represented as 
a record transforming mask X = AX, where A is an 
(n — 1) X ndimensional block matrix identical in struc- 
ture to Supp(hA), except: the central Z-matrix of A is of 
dimension (n — 1) xX land the dimensions of the upper 
and lower /-matrices of A are (h — 1) xX (Ah — 1) and 
(n — h) X (n — h). This A-matrix is denoted Del(h). 
An alternative formulation is Del(h) = ¥j;-),U;; + 
Vi>nUj-1,i- 

Deletion of more than one record is represented as a 
product of A-matrices Del(h). For example, to delete the 
h-th and i-th records of X, with i > h, use Del(i — 1) 
Del(A). For i < h, use Del(i)Del(h). The dimensions of 
Del(@i — 1) and Del(h) are (n — 2) X (nm — 1) and 
(n-—1) xan. 

The A-matrix that systematically deletes every h-th record 
(for n = rh; ran integer) is a block matrix comprising r 
vertical blocks Del(h), each of dimension (h — 1) X nv. 
This generalizes to nonsystematic deletion. 

The complement of record deletion is record sampling. 
The A-matrix that systematically samples every /-th record 
of X, forn = rh, is anr X n matrix whose q-th row is 
the 1 x n dimensional U-matrix Uj,,,. More generally, 
to draw a sample of size s comprising the records of X 
indexed by the’ser 3S "Use y= It es eeiccrmne 
A-matrix Sam(X, S) of dimension s x n, each row of 
which is a U-matrix Uj,,, of dimension 1 x n. 


3.2 Aggregating and Grouping Microdata 


The risk of a respondent being identified and confiden- 
tial data disclosed tends to decrease as data are more highly 
aggregated. Attribute aggregation and other microdata 
masks are based on this principle. 

The aggregation mask that replaces the first of two 
attributes (the j-th attribute) by the sum of the two attri- 
butes, and deletes the second attribute (the k-th attribute) 
from X, for j < k, can be represented as an attribute 
transformation ¥ = XB, where B is the p x (p — 1) 
dimensional block matrix: 


1 Zz 
Agg(j,k) = Uj; 
Lak. 


The upper /-matrix of Agg(j,A) is of dimension (k — 1) x 
(k — 1), the lower /-matrix is of dimension (p — k) X 
(p — k), and the central U-matrix Uj; is of dimension 
1 x (p — 1). Alternative formulations are 


Agg(j,k) = Supp(k) + U,;, 
Agg(j,A) = Supp(k) + U,j-1, for j > k. 


fOl = je< kee wand 
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Aggregation-deletion over more than two attributes can 
be represented as a product of B-matrices of this form. 
Construct B, as above to aggregate the first two attributes 
to a subtotal, replace the first attribute by the subtotal, and 
delete the second attribute. Proceed iteratively forming 


B,, ...,B,_, until all summand attributes have been 
incorporated into the total and deleted. Then B = B, --- 
Bee 


An alternative formulation for aggregation of the j-th 
and k-th attributes, replacement of the j-th attribute, and 
deletion of the k-th attribute, is given by the B-matrix 
product Add(j,A) Supp(A). Aggregation and replacement 
of the j-th attribute without deleting the k-th attribute can 
be accomplished using the p X p dimensional B-matrix: 
Add(j,k) = I + U,,;. This generalizes to more summands 
v by adding more U,;. To create a new totals attribute 
(attribute p + 1) from the /-th and k-th attributes without 
replacing either attribute, formthep x (p + 1) dimen- 
sional B-matrix [J | U;; + U,,], whose I-matrix is of 
dimension p xX p, and whose right-hand submatrix is of 
dimension p x 1. Aggregating another attribute v amounts 
to adding additional U,, to the right-hand submatrix. 

Grouping categorical data, sometimes referred to as 
collapsing categories, is representable as attribute aggrega- 
tion. Represent each of the c mutually exclusive categories 
of a categorical variable by a column of X. The absence 
(presence) of the corresponding trait is represented in each 
column by 0 (1). Grouping the c attribute categories to 
form one combined category is simply aggregation across 
the c attributes, replacing one attribute by the aggregate 
and deleting the remaining attributes, using B-matrices in 
the manner described above. 

It is sometimes desirable to aggregate attribute values 
across microrecords. For example, if microrecords can be 
grouped according to some notion of ‘‘similarity”’ (e.g., 
age or profession, or total value of shipments or size of 
work force for businesses in a particular industry), then 
an alternative to releasing high risk microrecords is to 
release a microdata file whose records are microaggregates 
or microaverages of subsets of the original records. 

Record aggregation can be performed in several ways. 
A typical case is to replace all summands by the correspon- 
ding totals. Assume that the records to be microaggregated 
are arranged consecutively, and denote the respective 
sizes of the record groups by nj, no, ..., Ns, where 
n=n, +n+.... + n,. Microaggregation can be 
accomplished using a diagonal block A-matrix of dimen- 
sionn X n. The main diagonal of A is comprised of an 
ordered block of square J-matrices of dimension n, X n,,, 
v = l,....,5; the remaining entries of A are zero. Under 
microaggregation (microaveraging), original values are 
replaced by microaggregates (microaverages) in each 
record of the aggregation group. Alternatively, in each 
group one record may be replaced by the microaggregated 
record while the other records are deleted. This may be 


167 


accomplished using J-matrices of dimension 1 x ”,, in 
which case the dimension of A is s x n. To construct 
microaverages in lieu of microaggregates, each J-matrix 
is replaced by its corresponding (1/n,)J. 


3.3 Scrambling Record Order 


A microdata file X being prepared for public use is 
typically derived from a larger data file (e.g., by sampling) 
or from a more detailed file (e.g., by removal of directly 
identifying information such as name, address, and social 
security number). The larger file is often maintained ina 
prescribed sort order, such as by geography or social 
security number, and_X is apt to inherit this ordering. To 
reduce disclosure risk, the order of the microrecords of X 
must be scrambled. Record scrambling can be accom- 
plished using a stochastic A-matrix. Given a reordering of 
the rows (records) of X (i.e., a permutation P of the row 
numbers {1,...,”}), then for P(i) = h, set the i-th row 
of A equal to the U-matrix U,,, of dimension 1 x n. A is 
denoted Reo(P). An alternative formulation is Reo(P) = 


Liana: 


3.4 Rounding and Perturbing Microdata 


Data rounding is used by statistical agencies for several 
purposes, including disclosure limitation. Integer variables 
such as age or years worked, or number of children, 
presented exactly, could be used in combination with other 
information to identify respondents (Bethlehem, Keller 
and Pannekoek 1990). Conventional rounding (e.g., base 
5, remainders 0, 1, 2 are rounded down; remainders of 3, 
4 are rounded up), does not preserve additivity to totals, 
and controlled rounding, designed to preserve additivity 
to totals in one and two way tabulations, may be preferred 
(Cox and Ernst 1982). Methods are also available for 
unbiased controlled rounding in one- or two-way tables 
(Cox 1987). 

Data perturbation limits disclosure by introducing 
slight changes to microdata values. Additive perturbation 
amounts to adding appropriate perturbation values to 
original values. Additive perturbation values are often 
drawn randomly from a distribution with mean zero and 
variance small relative to that of the data. Nonrandom 
perturbation is also used. 

Rounding and additive perturbation can be represented 
as displacing masks. For each value xj, the displacement 
cj to x; is computed according to the rounding or pertur- 
bation algorithm, with c;, = 0 for those values not subject 
to change. Then, X = X + Cis the matrix of rounded 
(perturbed) values. 


3.5 Attribute Topcoding 


Attribute topcoding is a method by which, given a 
predetermined (large) value 7; of the j-th attribute, all 
values x, > 7; are replaced by 7;. Given x = fi 7; + rj, 
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for fj; the integer quotient, and rj the remainder, 0 < 
rj < Tj, compute tj = (Max{r,;, (7; + 14% — 13) 
mod(7; + 1). To topcode X, use the displacing mask 
Tco(X) = (tj — xj). 


4. REPRESENTATIONS OF DATA MASKS 
AS MATRIX MASKS 


4.1 Selecting and Modifying Attribute-Record 
Combinations 


The formulations of the preceding section, based on ele- 
mentary matrix masks, are applied to the entire microdata 
file X, and do not enable selective masking of arbitrary 
subsets of records (rows) and/or attributes (columns) of 
X. The ability to selectively manipulate microdata values 
within subsets of X (i.e., to apply data masks selectively 
to submatrices of X) is important for disclosure limitation 
purposes. This can be accomplished by combining elemen- 
tary matrix masks that enable subset selection along rows 
and columns, or both, in X with elementary matrix masks 
as presented previously. This is accomplished in three 
stages. 

At the first stage, apply the ignoring mask Ign(Q, R) = 
AXB, where A is then X ndimensional matrix A = Y icg 
U;;, and Bis the p X p dimensional matrix B = Y jc¢rUj;;. 
A leaves the values in the selected rows O of X unchanged, 
and replaces all other values by zeroes; B has similar effect 
on the columns R. At the second stage, apply the appro- 
priate mask or combination of masks M of Section 3 to 
Ign(Q,R) to effect the desired changes, yielding X= 
M(gn(Q, R)). As Mis designed to change only the selected 
values, then all ignored values - which Ign(Q, R) replaced 
by zero - remain zero after applying M. To preserve the 
dimensions of X, deletion operations are modified to 
replace values to be deleted by zero. Finally, restore the 
ignored original values of X by means of 


X = Milgn(Q, R)) + X — Ign(Q, R). 


4.2 Blurring 


When the operation M is microaveraging, the formula- 
tion of Section 4.1 provides a matrix mask for the data 
mask blurring of Strudler, Oh and Scheuren (1986). 


4.3. Data Swapping 


Data swapping is a method whereby selected data 
values are exchanged between selected sets of records, in 
a manner that ensures that certain one, two and higher- 
way tabulations remain unchanged (Dalenius and Reiss 
1982). Setting M = Reo(P), where the swapping rule 
is given by a permutation P of the affected records, 
Section 4.1 yields a matrix mask for data swapping. 


5. CONCLUDING COMMENTS 


A formulation based on matrix algebra for representing 
the principal statistical disclosure limitation methods for 
microdata has been developed. Computational issues, 
such as for large files, are not addressed. However, the 
partitioning methods of Section 4.1 could be used to 
reduce effective computational size when working with 
extremely large files. 

Matrix masks offer a comprehensive framework in 
which statistical agencies can develop, evaluate and use 
reliable microdata disclosure limitation software. Such 
software could be shared among agencies. Exploration 
of the uses of matrix masks by U.S. statistical agencies 
has been encouraged by an expert panel (Federal Com- 
mittee on Statistical Methodology 1994, p. 82). The poten- 
tial effect of the widespread use of matrix masks would 
be to standardize the microdata disclosure limitation 
methods available for use by agencies, while expanding 
each agency’s options to evaluate and apply these 
methods. 
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Empirical Comparison of Small Area Estimation Methods 
for the Italian Labour Force Survey 


P.D. FALORSI, S. FALORSI and A. RUSSO! 


ABSTRACT 


The study was undertaken to evaluate some alternative small areas estimators to produce level estimates for unplanned 
domains from the Italian Labour Force Sample Survey. In our study, the small areas are the Health Service Areas, 
which are unplanned sub-regional territorial domains and were not isolated at the time of sample design and thus 
cut across boundaries of the design strata. We consider the following estimators: post-stratified ratio, synthetic, 
composite expressed as linear combination of synthetic and of post-stratified ratio, and sample size dependent. For 
all the estimators considered in this study, the average percent relative biases and the average relative mean square 
errors were obtained in a Monte Carlo study in which the sample design was simulated using data from the 1981 


Italian Census. 


KEY WORDS: Small area estimators; Unplanned domains; Bias; Mean Square Error; Simulation study. 


1. INTRODUCTION 


In Italy, as in many other countries, there is a growing 
need for current and reliable data on small areas. This 
information need concerns most sample surveys realised 
by the Italian National Statistical Institute ISTAT), espe- 
cially the Labour Force Survey (LFS), which has been 
studied to warrant accuracy in regional estimates. 

In the past, ISTAT’s solution to this problem was to 
broaden the sample without changing the estimation 
method (Fabbris e¢ a/. 1988). In the last few years, however, 
in order to find a solution to the negative aspects of over- 
sized samples, research has been launched to identify 
estimation methods to improve the accuracy of small areas 
estimates (Falorsi and Russo 1987, 1989, 1990 and 1991). 

In our study, the small areas are the Health Service 
Areas (HSA), which are unplanned sub-regional territorial 
domains and were not isolated at the time of sample design 
and thus cut-across the boundaries of the design strata. 
The sizes of these territorial domains are such that the 
reliability of regular estimates would have been satisfactory 
had these domains been designed with separate fixed 
sample sizes from individual domains. 

The study was undertaken to evaluate some of the 
alternative small areas estimators to produce HSA level 
estimates from the LFS. 

We consider the following estimators: post-stratified 
ratio, synthetic, composite (expressed as linear combination 
of the synthetic and of the post-stratified ratio), and 
sample size dependent. 

For all the estimators considered in this study, the 
average percent relative biases and the average relative 
mean square errors were obtained in a Monte Carlo study 


in which the LFS design was simulated using data from 
the 1981 Italian Census. 


2. BRIEF DESCRIPTION OF THE LFS 
SAMPLE STRATEGY 


2.1 Design 


The LFS is based on a two stage sample design stratified 
for the primary sampling units (PSU). The PSUs are the 
municipalities, while the secondary sampling units (SSU) 
are the households. In the framework of each geographical 
region the PSUs are divided according to the provinces. 
In each province the PSUs are divided into two main area 
types: the self-representing area consisting of the larger 
PSUs, and the non self-representing area consisting of the 
smaller PSUs. 

All PSUs in the self-representing area are sampled, 
while the selection of PSUs in the non self-representing 
area is carried out within the strata that have approxi- 
mately equal measures of size. Two sample PSUs are 
selected from each stratum without replacement and with 
probability proportional to size (total number of persons). 
The SSUs are selected without replacement and with equal 
probabilities from the selected PSUs independently. All 
members of each sample household are enumerated. 


2.2 Estimator of Total 


With reference to the generic geographical region, we 
introduce the following subscripts: /, for stratum (/ = 1, 
..., H); i, for primary sampling unit; /, for secondary 
sampling units; g, for age-sex groups (g = l, ..., G). 


! PD. Falorsi, Senior Researcher, National Statistical Institute, Rome, Italy; S. Falorsi, Researcher, National Statistical Institute, Rome, Italy; 
Aldo Russo, Associate Professor, University of Molise, Campobasso, Italy. 
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In the present study we consider the following age classes 
14-19, 20-29, 30-59, 60-64, and over 65. 

A quantity referring to stratum /, primary sampling 
unit 7, and secondary sampling unit / will be briefly referred 
to as the quantity in Ai/; and a quantity referring to stratum 
h and primary sampling unit / will be referred to as the 
quantity in hi. 

The following notations are also used: N,, for number 
of PSUs in h; P,,, for total number of persons in h; n,, for 
number of sample PSUs selected in h; M,, for number of 
SSUs in hi; P,;, for total number of persons in hi; m),;, 
for number of sample SSUs selected in hi; P,,,;;, for 
number of persons in group g belonging to hij; P,,;;, for 
number of persons in hij. 


Further let 


Nn Mhni 


roe a shi 


be the total of the characteristic y for regional population, 
where Y,,;; denotes total of the characteristic of interest 
y for the P,,,;; persons. Actually, the estimate of Y is 
obtained by a post-stratified estimator. This estimator is 
given by: 


where 
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In the above formulas, the symbol K;,, that denotes the 
basic weight, is expressed by: 


Pn Mi 


3. SMALL AREA ESTIMATORS 


With reference to the generic geographical region, we 
suppose that the population P is divided into D non- 
overlapping small areas 1, ..., d, ..., D for which esti- 
mates are required. Each area is obtained by an aggrega- 
tion of municipalities. The problem considered is the 
estimation the total of a y-variable for all units belonging 


Empirical Comparison of Small Area Estimation Methods 


to the small area d. In practice, the small area d will have 
a non-null intersection with only a certain number of 
design strata which we denote as H = {h| qP, > 0}, 
where ,P, represents the part of P,, belonging to the small 
area d. 

Denoting by ,N, the number of PSUs belonging to 
small area d in stratum h/, we seek to estimate the small 
area total 


A dNn Mnhi 


peas Se Shaya gp 


ye 


The development of a particular estimation method for 
small areas basically depends on available information. In 
Italy the accessible information at small area level is very 
poor. At present the accessible territorial information is 
total population by sex for each municipality collected 
through register statistics. In a future context (at end of 
1994), the population counts by age-sex group will be 
available for each municipality. For this reason, in the 
present study we consider only those small area estimators 
that utilize, as auxiliary information, the population total 
by age-sex group. 


3.1 Post-stratified Ratio Estimator 


A post-stratified ratio estimator (POS) of ,Y is given by: 


~ 
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in which _P,,, denotes the total population for the age/sex 
group g in small area d intersected by stratum h, 6); is a 
binary variate that equals 1 if the PSU Ai belongs to the 
small area d and equals 0 otherwise. For a better explana- 
tion of formula (1), we observe that PSU is a subset of 
small area and then does not intersect it. 

The post-stratified ratio estimator is unbiased except 
for the effect of ratio estimation bias which is usually 
negligible. The estimator is defined to be zero when there 
is no sample within the domain. This estimator is not 
reliable for small sample sizes. 
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3.2 Synthetic Estimator 


For computing a synthetic estimator, it is assumed that 
the small area population means for given population sub- 
groups are approximately equal to the larger area popula- 
tions means of the same sub-groups. This estimator is 
obtained by means of a two steps procedure: (i) with 
respect to an aggregated territorial level, estimates of the 
investigated features are determined for population sub- 
groups; (ii) estimates for the aggregated territorial level 
area are then scaled in proportion to the sub-group inci- 
dence within the small domain of interest. 

The synthetic estimator has a low variance since it is 
based on a larger sample, but it suffers from bias depending 
on the distance from the assumption of homogeneity, for 
each subgroup, between the small area and the larger area 
with reference to the characteristic of interest, y. The 
problems associated with synthetic estimators have been 
documented by Purcell and Linacre (1976), Gonzalez and 
Hoza (1978), Ghangurde and Singh (1978), Schaible (1979) 
and Levy (1979) among others. 

In this study we consider the following form of synthetic 
estimator (SYN): 
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Nh Nh Mh 


Mhi 
oe uk hij Nae one 5 » % MS ale ghij- 


h=1 ft=1 j=l 


i to 


3.3 Composite Estimator 


The composite estimator (COM) considered here is 
obtained as a linear combination of the estimators SYN 
(biased with low sample variance) and POS (less biased 
with high sample variance): 


aYcom = @aYpos + (1 — @)aY¥oyn; (3) 


where ais aconstant (0 < a < 1). This estimator mini- 
mizes the chances of extreme situations (both in terms of 
bias and sample variance). Therefore, in a given concrete 
situation such estimator may turn out to be more advan- 
tageous than its two components considered separately. 

The optimum value for a that minimizes the MSE of 
the COM estimator is given by 


Qopt = 


MSE (a¥syn) — E(aY¥eyn —aY) (a Ypos — Ya) 


MSE(q¥eyn) + MSE(aYpos) — 2E(aY¥eyn —aY)(a¥pos — Yu) 


(4) 
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Furthermore, when neglecting the covariance term in 
(4), under the assumption that this term will be small 
relative to MSE(,Y¥syn) and MSE(,Ypos), the optimal 
weight a can be approximated by 


be, MSE (aY¥syn) 
Qopt = a y g (5) 
MSE(aYeyn) -& MSE (a¥pos) 


This is the approach to define weights followed by Schaible 
(1978). 


In our work the optimal values of a have been obtained 
from Census data using formula (5). When considering a 
real sample survey only an estimated value of optimum a 
may be used, thus resulting in a decrease in efficiency. 


3.4 Sample Size Dependent Estimator 


The sample size dependent estimator is a particular case 
of the composite estimator. The linear combination of 
synthetic and of the less biased estimator is made for each 
sub-group and depends on the outcome of the given 
sample. We consider the following form of sample size 
dependent estimator (SD) which take into account the 
realized sample size in the small area. It is defined as 
(Drew, Singh and Choudhry 1982): 


i z ay: Y, 
aX¥sp = y fa, & iP.) ue Gl a4) fat (6) 


g=l 8 


where 


1 REE dee SIE 
a = | ef’) /qKg (7) 


1 otherwise 


with gRoi= Pa Ps 

The constant Fis chosen to control the contribution of the 
synthetic component. The reliance on the synthetic portion 
decreases as the value of F increases. The choice of the value 
for F would depend upon several factors. In our study the 
efficiency of sample dependent estimator has been inves- 
tigated for F = 1. This value proved to be efficient while 
affording protection against the bias of synthetic estimator. 

The logic behind the SD estimator is that when the 
sample size within domain d and group g is small, then the 
direct estimate for domain d and group g would be unstable 
and a synthetic estimate may be superior. However, if the 
sample in domain d and group g is larger than expected 
this is not a problem, since the performance of the post- 
stratified direct part would improve as the sample size 
improves. In conclusion, we observe that SD estimator 
may be considered as a particular form of sample size 
dependent regression estimator given in Sarndal and 
Hidiroglou (1989), that has good conditional properties. 
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4. DESCRIPTION OF THE EMPIRICAL STUDY 


4.1 Simulation of the LFS Sample Design 


In our study, we have considered the 14 HSAs of the 
Friuli region as small areas. The variable of interest, y, is 
the number of unemployed. 

Evaluation of the performance of the various estimators, 
discussed in Section 3, was done by referring to a sample 
design (two stages with stratification of the PSUs) identical 
to that adopted for the LFS in Friuli. This design is based 
on the selection of 39 PSUs and 2,290 SSUs from a popula- 
tion of 219 PSUs and 465,000 SSUs. 

We have selected independently 400 Monte Carlo 
sample replicates each of identical size (in terms of PSUs 
and of SSUs) of the LFS’ sample. All the information 
utilized in the simulation is taken from the 1981 General 
Population Census, so yY is known. 


4.2 Evaluation of Small Area Estimators 


We denote by ,Y (mr) the estimate of the total ,Y for 
the small area d from the rth Monte Carlo replicate when 
using the estimator m. The percent relative bias of esti- 
mator m for the small area d is given by 

=) 100, 


where R is the number of samples (R = 400). 


R 
vate ve 
d 
a ARB,, = Ak D 
=] 


The average of the percent absolute relative bias of esti- 
mator m over the whole set of small areas is: 


ARB» = 


| 


D 
bs | d ARB », |, 
=| 


where D is the number of small areas under observation 
(Dis) Taye 
The percent root mean square error of estimator m for 
small area d is 
liMSE,,, 
ER NSE een: 
ue 


where the mean square error of estimator m for the small 
area d is expressed by 


lsc ator : 
aMSE,, a R » (aY (mr) = ie) : 


The average percent root mean square error of esti- 
mator m over all areas is 
ee 
RMSESe = a ae 7RMSE 
a= 


Empirical Comparison of Small Area Estimation Methods 


4.3 Analysis of Results 


A. Overall Performance Measures 


The average percent absolute biases and the average 
percent root mean square errors of the small area esti- 
mators for the LFS characteristic ‘‘number of unemployed 
persons’’ are presented in Table 1. Looking at this table, 
the following conclusions emerge: 


(i) As expected, POS presents the smallest bias. The bias 
of SYN is larger than the bias of the other estimators. 
The bias of COM is roughly 30% lower than the bias 
of SYN estimator. The bias of SD estimator is only 
slightly lower than that of POS estimator. 


(ii) SYN and COM have the smallest average percent root 
mean square errors, but these estimators are affected 
by avery high bias. POS, with low bias, is, conversely, 
the less efficient estimator. The average percent root 
mean square error of SD is approximately 30% higher 
than those of SYN and COM estimators. 


Table 1 


Average Percent Absolute Relative Bias ARB __ 
and Average Percent Root Mean Square Error RMSE 
for Unemployed by Estimator 


Estimator ARB RMSE 
POS ia 42.08 
SYN 8.97 23.80 
COM 6.00 23557, 

SD 2.39 31.08 


B. Performance Measures by Small Area 


Tables 2 and 3 present the Percent Relative Bias (,ARB) 
and the Percent Root Mean Square Error (,RMSE) of the 
estimators for each of fourteen Health Service Areas in 
Friuli. Furthermore, Table 2 gives the percent ratio between 
the population of the HSA and the population of the set 
H of strata including the HSA (p,); Table 3 shows the 
percent ratio between the population of the HSA and the 
population of the region Friuli (p,) and the percent ratio 
between the population of the set A of strata including the 
HSA and the population of the region Friuli (p3). Looking 
at these Tables, the following conclusions emerge: 


(i) SYN and COM are badly biased in some small areas, 
namely, in those small areas where the model under- 
lying SYN fits poorly. Generally the small areas with 
low values of the ratio p, are affected by large bias 
(e.g., HSAs 1, 2, 3, 4and 6). Conversely, large values 
of the ratio p,; are associated with low values of the 
bias (e.g., HSAs 5, 9, 10 and 13). However, SYN and 
COM consistently have an attractively low RMSE 
compared to other alternatives. In three of the fourteen 
areas (viz, areas 3, 4 and 8) COM is consistently the 
most efficient estimator. In two areas (10 and 12) 
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SYN is evidently more efficient and in the remaining 
areas the two estimators are roughly similar from the 
point of view of efficiency. Furthermore, we observe 
that the lowest values of RMSE for SYN generally are 
associated with the highest values of the ratio p; 
(e.g., HSAs 1, 2, 5, 6, 9 and 13). HSAs 3 and 4, while 
having an high value of the ratio p3, present a high 
value of RMSE. This is due to the large bias. 


(ii) POS shows negligible bias values in almost all small 
areas. The RMSE values of POS are much higher 
than those of the other estimators in all the small 
areas. We observe that the RMSE of the POS esti- 
mator is negatively correlated with the ratio py. This 
is caused by the fact that the expected sample size 
increases as the ratio p, increases. Consequently, the 
variance (which is the main component of MSE of 
POS) decreases. 


(iii) The estimator SD presents a negligible bias in seven 
(5, 7,910, 11, 12 and 13) of the fourteen small areas. 
In the other areas the bias is quite low. Furthemore, 
in nine areas (2, 3, 4, 5,9, 10, 11, 12 and 13) SD has 
a bias similar to that of POS. The estimator SD is 
better, from the MSE point of view, in comparison 
with POS. In four areas (7, 8, 9, and 13) RMSE is 
similar to those of SYN and COM. 


(iv) Finally, we notice that in the largest areas with the 
highest values of the ratio p> (e.g., HSAs 9 and 5) all 
the estimators considered give similar results in terms 
of bias and MSE. For the remaining areas, where the 
estimators have different performances, there is a 
problem in the choice of the best estimator. 


Table 2 


Percent Relative Bias (yARB) of Each of Fourteen Health 
Service Areas (HSA) in Friuli for Unemployed by Estimator 


Estimator 

HSA Pi POS SYN COM SD 
1 19.1 —1.57 — 10.92 — 7.68 —3.01 
2 16.1 —5.61 —9.21 —6.97 — 4,79 
3 S38 —5.21 28.82 17.98 S19 
4 16.3 — 2.50 20.92 15.02 2.99 
5 ATA — 0.46 1.61 0.98 — 0.28 
6 24.6 — 1.37 — 12.24 — 9.06 — 3.28 
oT 81.8 0.05 — 6.25 — 3.40 — 1.66 
8 WOsi 0.81 11.80 6.63 DMF 
9 92.2 0.47 0.76 0.68 0.78 
10 Pe? 0.36 — 1.34 0.51 — 1.02 
11 Deal —1.01 — 5.64 — 5.00 — 1.62 
12 40.6 —1.52 — 6.66 — 6.05 —1.19 
13 56.3 —0.95 — 3.12 —1.11 — 1.28 
14 21.8 —2.51 —6.21 — 3.03 — 3.53 

Pp, = percent ratio between the population of the HSA and the 


population of the set H of strata including the HSA. 
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Table 3 


Percent Root Mean Square Error (gRMSE) of Each of 
Fourteen Health Service Areas (HSA) in Friuli 
for Unemployed by Estimator 


Estimator 


HSA P2 P3 POS SYN COM SD 


1 3.8 1) SVB. 20.41 Zales 32-59 
2: 3)aIh 19 63.36 19.45 20.81 38.30 
5 3.6 28e2 57.44 SO 30.71 42.46 
4 3.8 DEY 58.19 30.09 27.02 36.88 
> 202 42.9 18.8] 13.38 14.01 17.87 
6 8.5 34.8 28.09 17.49 17.00 22.69 
7 6.9 8.4 23.83 21.47 DME GI 22.67 
8 4.8 6.8 28.75 28.54 26.35 27.40 
9 DM 229 W229 16g15 16.40 16.89 


10 1.8 2S) 67.00 SO Soil S27 
11 BP) 14.6 49.82 18.35 19.20 30.42 
12 4.3 10.7 46.40 22.10 24.04 ahi 
13 12.6 22.4 20.13 15353 15.40 17.88 
14 2.3 10.1 57.80 Doras DOA 36.81 
P2 = percent ratio between the population of the HSA and the 


population of the region Friuli. ¢ 
percent ratio between the population of the set H of strata 
including the HSA and the population of the region Friuli. 


vd 
Ww 
II 


5. CONCLUSIONS 


From the point of view of bias, the post-stratified ratio 
estimator (POS) is essentially unbiased in almost all the 
small areas. Furthermore the sample size dependent esti- 
mator (SD) has negligible values of the bias in almost all 
small areas. Synthetic (SYN) and composite (COM) esti- 
mators present bias values much higher than those of the 
other estimators. 

From the point of view of efficiency, SYN and COM 
consistently have significantly lower RMSE compared to 
other alternatives. The estimator SD is much more efficient 
than POS and furthermore in four of the fourteen areas 
it shows RMSE values close to those of SYN and COM. 
Further, when considering the estimator COM there is the 
problem of the computation of optimum a. In practice 
only an estimated value of a may be used, resulting in a 
decrease in efficiency of this estimator. Thus considering 
both, bias and efficiency, the SD estimator would seem 
to be preferable to other estimators examined in the 
context of LFS in Friuli. The sampling rates in Friuli are 
relatively high and the magnitudes of relative biases and 
efficiencies of these estimators may be different in other 
regions where the sampling rates are low, e.g., Piemonte 
and Lombardia. 
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Nonparametric Estimation of Response Probabilities 
in Sampling Theory 


THEOPHILE NIYONSENGA! 


ABSTRACT 


We deal with the nonresponse problem by drawing on the model of selection in phases that was proposed by Sarndal 
and Swenson (1987). To estimate response probabilities, we use the nonparametric approach first advanced by 
Giommi (1987). We define estimators according to the nonparametric estimation (NPE) model, and we study their 
general properties empirically. Inference is based on the concept of quasi-randomization (Oh and Scheuren 1983). 
The emphasis is on estimating the variance and constructing confidence intervals. We find, by way of a Monte Carlo 
study, that it is possible to improve the quality of the estimators considered by using a variant of the NPE approach. 
The latter also serves to confirm the performance of regression estimators in terms of variance estimation. 


KEY WORDS: Weighting by phases; Regression estimator; Variance estimators. 


1. INTRODUCTION 


To counter the effect of nonresponse on the estimation 
of parameters of a finite population, we consider the 
phenomenon of nonresponse as a unit selection process 
in three phases. We therefore use weighting by phases. 
This adjustment procedure assigns to each unit observed 
a weight that is inversely proportional to the probability 
of appearing in the sample, to the unit response probability 
given the sample, and to the item response probability 
given the sample and the set of respondents per unit. 

In practice, only the probabilities of inclusion in the 
sample are known. The problem facing us is to estimate 
individual response probabilities before incorporating 
them in formulas for the estimators of interest. The non- 
parametric estimation approach is one of the response 
probability estimation procedures. It is motivated by the 
use of auxiliary variables which are linked with unit and 
item response mechanisms (Giommi 1985, 1987), and 
which may be correlated with the variables of interest. This 
avoids assuming that nonresponse is independent of the 
variables being studied (Oh and Scheuren 1983). This 
approach also enables us to avoid postulating one or more 
parametric models governing response, such as the Logit 
and Tobit models (Grosbras 1987b; Chicoineau, Payen 
and Thélot 1985) or models of uniform response within 
subpopulations (Oh and Scheuren 1983; Sarndal and 
Swenson 1985, 1987). 

In the Monte Carlo study illustrating certain estimators 
according to the nonparametric approach, we consider the 
quite specific case in which the two response mechanisms 
are governed by the same auxiliary variables. The differ- 
ence between items will reside in the degree of correlation 
between each item and the auxiliary variables. 


2. NONRESPONSE: A THREE-PHASE 
SELECTION PROCESS 


Consider a finite population U = {1, 2, ..., k, ..., N}, 
of size N. Let s be a sample of fixed size n drawn from U 
according to a plan @(s) known and characterized by 
inclusion probabilities 7, > 0,vkand am, > OWvk # 0. 
We want to observe the units k € s in relation to a set of 
Qitems Vp.. Ye) oo (C=) Fe then estimate the 
total per item t, = Yu Vx, foreveryg(q = 1,...,Q). 
We assume that conditional ons, each unit k has a prob- 
ability y, > 0 of participating in the survey and that the 
probability that two units k and ? participate is y,, > 0 
with ¢,, = g,. We denote the set of units that agree to 
participate in the survey by rand the mechanism by which 
the set r was obtained by @ (r | s). We further assume that 
conditional on s and r, each unit k € r responds to item 
Yq with probability y,, > 0 and that the probability that 
two units k and f€ rrespond to item yy is Wgx¢ > O with 
Wokk = Wax. We denote by r, the set of units that, having 
agreed to participate in the survey, respond to item y, and 
by @(r, | s,r) the mechanism by which the set r, is 
obtamedstorn alligi(Gi—mle ae O)) 

The sets s, r and r, are obtained from three selection 
phases for which only the probabilities of inclusion in s 
are known. The composition of the unit selection mecha- 
nisms gives rise to probability outputs that we denote 
by mF, Ook where Ook = MAN Wak and OgKe Sake Wake with 
Ogxk = Qgx, which do not correspond to inclusion prob- 
abilities. Nor does the quantity 0,, correspond to an 
inclusion probability for the two response phases condi- 
tional on s. If we define the probabilities of inclusion in 
rg by m3, = IP(k € r,)and the probabilities of inclusion 
in rg given s by OF, = IP(kK Erg |s), then (i) to, 7 1, OFx 


' Théophile Niyonsenga, Ph.D., Researcher, Centre de Recherche Clinique, Centre Hospitalier Universitaire de Sherbrooke, Sherbrooke, QC, Canada, 
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and (ii) T,O%, 7 m,O,,- Furthermore, (iii) OF, = Og, if 
probabilities Yj, are independent of r, and (iv) mj, = 
1 Oqx if the y, do not depend on s and if the y,, do not 
depend on either y ors. 


3. A FEW SPECIAL ESTIMATORS 


Assume that there is an auxiliary variable x, (for the 
q-th item) strongly correlated with the variable y, and 
such that x,, is known V k € sor V k € U. We take the 
specific case in which x, = x,, V g(q = 1, ..., Q), 
and we assume the following linear model & 


IK; (Vox | Xx) a By Xk 
il 
G xe otts heat oe 
Cov: (Vek ae | Xi.Xy) = E 

0 otherwise 


in which 6, and o, are unknown parameters. The follow- 
ing results are extensions of the findings of Sarndal and 
Swenson (1987). 


Result 1. If x, is known, Vv k € s, then the regression 
estimator, denoted by tRep and defined by: 


a Jak Xk Xk 
tRee = ( — ) —, G62) 
f = TO gk | they T Og Tk 


RY 


is approximately unbiased for ¢,. Its approximate variance 
isa sum of three components V,, V, and V3 representing 
the respective portions of the variance due to the selection 
phases, that is: 


ff a ve By Anny (Yok/Tk) (Voe/Te)s 
V2 i By si Aoxp (Egk/™ Px) (Ege / “)} 5 


Vv; = EE be yy Ayano (Lak! qn) (Egr/e Oger) | ‘| 


where the E,, are theoretical residuals of model (3.1). An 
estimator of V(fReg) is given by V(fReg) = Vi + Vo 
(where V3' = V, + V3) with: 


A 
5 T kp Jak Vat 
n= EE i (2) Ce). on 
L do Tk Ooxe Wk Te 
and 


VAG a 46 ake ( nk ) ( Cnn ‘r (3.4) 


i 
q OgKe TO gk ™ Oe 


where Aree = Te — Wk Te, Don = Pre — PRT, Ay xe = 
Woke — WokWoe and Do aki = Ogxe — OgxOge, the e,, being 
the observed residuals obtained from model (3.1). 


Result 2. If x, is known, v k € U, then the regression 
estimator, denoted by tRegl and defined by: 


— neu(D a sis ih (3.5) 


"q TO gk "q TO gk 


is approximately unbiased for ¢,. Its approximate 
variance is also a sum of three components V;, V, and V3. 
The expression of V; (fReo:) differs from that of V; (fReg) 
by the use of the theoretical residuals E,, in place of the 
raw values y,,, whereas the expressions of V> and V3 are 
identical to those defined above for ines An estimator of 
V(tRegi) is given by V(fReg1) = Vi; + Vs where: 


oa ja 9 (Quip (ae) 
Vues ke qk gt ; 26 
, L ae Te Ooxe ( es) ( Te on 


and where V3" = V, + V; is obtained by the formula (3.4). 


Comment 1. If x, = 1, vk € U, the formula (3.5) defines 
an estimator, denoted by texp where: 


Jak acTy 
"¢q TO gk 


5 nx Jak 1 = N 
sa a ve TO gk Dk TO gk N Lu 


The estimator texp is called an ‘‘expansion estimator’’. 
An estimator of approximately unbiased variance for 
V (tgxp) is derived from formulas (3.4) and (3.6). 


Comment 2. If we take 0,, = 0,(0 < 9, = 1), VK EU, 
in formula (3.7), we obtain an estimator, denoted by 
{Naive Called a ‘‘naive estimator’’. Its expression is given 
by: 


tNaive = NDE ail a : (3.8) 


'q 1k "q Tk 


If the x, are constant, the expression (3.8) becomes iden- 
tical to formula (3.5) in which ¢ is assumed that O,, = 
O,(0 <..05 s.1) v ke Usand Niele Vik eat 


Comment 3. For the four estimators defined above, the 
underlying models are derived from model (3.1) and are the 
following: yg, = BgXx + €gx, Elegx) = Oand Vie) = 
04 X, for the first two, Yox = Bg + €qks E(€gx) = 0 and 
V(€qx) =, and N is known for the last two. For the 
naive estimator, it is necessary to add the uniform unit and 
item response model. 
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4. ESTIMATORS WITH ESTIMATED RESPONSE 
PROBABILITIES 


In practice, the response probabilities y;, and Wg, as well 
as the probability outputs O,, = gx Wgx(k € U, q = 1, 
..., Q) are actually parameters to be estimated. We 
estimate them by ¢,;, Wok and Ook = Dy Wak respectively. 
We define estimators having the same form as the pro- 
totype estimators fxp, fReg ANd fReg Seen in section 3, 
taking care to replace the unknown parameters by their 
respective estimates. We denote these estimators by 
fExpnp» PReenp ANd fReginp Tespectively. The variance esti- 
mators are obtained from the expressions (3.3), (3.4) and 
(3.6), in which the unknown parameters are replaced with 
their estimates. 


4.1 Estimation of Response Probabilities 


In theory, the probabilities y, and y,, are functions of 
the auxiliary variables, that is, functions of the form 
ke = Si (v,Z) and Wax = fo (MgsXqx) in which the quantities 


vand ug(q = 1, ..., Q) are unknown parameters and 
where the pair of vectors (z,x,), that is, [(Z),%g1), -++s 
(ZesXqk)> +++» ZMXgn)]’, Contain the auxiliary infor- 


mation available for each item y,. The nonparametric 
estimation approach uses only the information contained 
in (Z,X,) to estimate the y, and W,,. We are considering 
here the specific case in which the z, = Xj = X,%, V q 
Cea Oe and’ vk € 8, 

Let x, = {x,:k € s}, all the auxiliary information 
relating to the sample. We specify 7, = {7,:k € s}, aset 
of functions such that 7,:IR"” — IR', for all k ins. We 
denote by g, = 7,(x,), V k € s, the value of the k-th 
function evaluated in x,. We subdivide s in n groups s, 
not necessarily disjoint, the respective sizes of which are 
given by: 


Hy = ) Dey 28), ees), 


JES 


D(& ~ &) = " PORE) = Sh 
0 otherwise, 

for a given constant ), which may depend on all the values 
Pees) .eihe set sp HW agne lepeediil av Kk € Ss, 
contains j units, whose values g; vary little from one to 
another. This group is called the group whose unit x is the 
kernel, or simply the k-th group. In other words, s,; is a 
subset of s for which the values of x fall within the vicinity 
of x = x, in the sense of the Euclidian distance that 
specifies d(k,j).= |. 74%). 4%) | S ee = (8x) 
meaning that so Aid (kp )es Ayph 1ei tp, = S00 
and raz = S, rg. The respective absolute frequencies of 
these sets are m, and m,, where: 
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my = )) D(ge = 8)s (KE); 
Jer 
[ate ap lh ieee Sp Liek It STG M- tes x's) 
J&fg 


Comment 4. In the general case in which nonresponse is 
governed by the pair of vectors (z, x,) with z # x,, the 
7, functions would be defined in terms of z in order to 
estimate the unit response probabilities y, and in terms of 
X, to estimate the item response probabilities y,,. Note 
that this kernel approach can be generalized to more than 
one auxiliary variable governing response. For two variables 
xX, and x, governing nonresponse, we would specify the set 
Se= { Ji.J2): 8, € (8k, + My,] anda,, € [g,, + Mp)}. 

Response probabilities ¢, and y/,, are estimated respec- 
tively by the rates: 


§& 


m “ m 
c= — VK eri be =—Avker, (4.1) 
Nk Mr 


whereas the output O,, = yg xox is estimated by the rate: 


Ogk = Ox Pqk = Mok/Ne (KErg, gq =1,...,Q), (4.2) 


which is nothing other than the response rate in the k-th 
group. This simplification of the estimated output Oak c 
$x Wgx is, however, possible only when the two response 
mechanisms are governed by the same auxiliary variables. 

Two approaches are considered here: the one based on 
the values of the variable x (npx) and the one based on the 
ranks of the values of the variable x (npr). The NPE 
(npx), proposed by Giommi (1987), is obtained by taking 
&y = |(X,) = X; (kK € Ss). To offset the possible effect 
of excessively large and excessively small values of x,, we 
introduce a variant that consists in using the ranks of x,, 
that is, NPE(npr). We consider the function wu such that 
(= ite 2 Oeinclite) Orie <— Onliee euonaunanie 
kK in s, let u, = You(x, — x;) = the number of com- 
ponents of x, that are less than or equal to x, = the rank 
of x, in s. The NPE(npr) is then equivalent to letting 
Be = 7 (Xs) = UZ (KES). 


4.2 Selection of Interval Limits 


The main problem in the NPE approach is the optimum 
choice of the h, constants that determine the limits of the 
intervals [g, — hy; g, + hy], V k € S, that is, a choice 
of h, = h,(g,;) that reduces the bias and mean square 
error of any estimator using the estimated outputs Ook 
specified in formula (4.2). 

According to Giommi (1985, 1987), the terms n;,, m, 
and m,, that are used to estimate the response probabilities 
are, apart from the standardization factors, estimators by 
the kernel method of the density function according to the 
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approach of Rosenblatt (1956) for the various series of 
values of g. As an example, it is easy to demonstrate that: 


ne = Y) D(gy — Bj) = 2nh(n) f(x), 


ES 


where h(n) = h(g,, kK € S) is a positive constant that 
converges toward zero at a quite appropriate rate. The 
theoretical optimum constant, according to the least 
mean square error criterion, is given by h(n) = Kpn— fe 
where Ky, such as defined by Rosenblatt (1956) and 
Wegman (1972a and ae is obtained by the expression 
K, = [9f(x)/2 | fo) 71. 

In practice, (1) can be obtained only by simulation, 
since it depends on the density function to be estimated. 
Giommi (1985) used h(n) = 2EI,n~'/> where EI, is the 
interquartile range in the sample. Kraft, Lepage and 
van Eeden (1983) chose h(n) = C(n)EI, where C(n) = 
(K;/EIs)n~'/>. As our choice, we shall adopt h(n) = 
C(n)S,;, where C(n) = (K;/S,;)n~'/° and where S,, is 
the corrected standard deviation of the values g;(k € s). 
Basing ourselves on the study of Kraft, Lepage and 
van Eeden (1983), we will empirically determine a value 
C,, of C that is optimal according to the criterion of least 
bias and least mean square error of the estimator ae 
and compare the two versions of the NPE approach. 


4.3. Expansion and Regression Estimators 


Calculation of the approximate bias and variance of the 
estimators fpxp, fReg aNd Regi is simplified by the fact that 
the probabilities y, and 4, are assumed to be known. For 
estimators Om: iRedups and eae these probabilities 
are estimated by ¢, and Wok: These probability estimators 
do not respond to any probability model that would enable 
us to calculate the bias and the variance conditional on 
this model. In other words, the sets rz are generated by 
unknown response mechanisms for which we estimate the 
response probabilities by an approach that does not allow 
for inference conditional on any model underlying the 
estimation of probabilities. 

We would be tempted to resort to Taylor’s serial 
development of the function 1/0 5x to justify the approx- 
imation of 1/O4x by 1/O,,. In this case, the bias and the 
variance of Lipa. tree and fRept np WOuld be approached 
by the approximate bias and variance of Sop icone and 
oa However, for sample sizes that are not sufficiently 
large, we are in danger of having vor Fen Oo, (Or the 
majority of the k € r,, and consequently: 


V(tExpnp) # V(texp)s V(tReenp) # V(tReg), and 
V (tReginp) cs V (tRegi)- 


However, to construct confidence intervals based on 
tExpnp» SReenp ANd CReginps it is necessary to define esti- 
mators for their respective variances. Not having explicit 


expressions for these variances, it is difficult to define 
variance estimators and study their properties analytically. 
The choice of a given estimator is quite difficult to justify. 
The most natural Way of ‘obtaining variance estimators for 
the variances of este de reo and EReginp is to do a simple 
substitution of Oak ( = = OWax), by Og (= Oy Wak) VKET,, 
and of Ogx by Gai. VK AE, (OFn = = Ox Woke)» in 
all the formulas for variance shiteters specified for the 
respective variance estimators of estimators ee, income 
and Tete 


5. MONTE CARLO STUDY: COMPARISON 
OF ESTIMATORS 


For simulation purposes, we assume that Bernoulli 
trials govern each of the response mechanisms (total or 
partial) and that a simple random sampling without re- 
placement is the sample design used. We consider a vector 
(11, ¥2, ¥3)’ of three items (Q = 3) and a variable x 
containing the auxiliary information. We first generate the 
x,(k € U) by a gamma distribution with parameters a, 
and a. The generation of items y,, yz, v3 is based on the 
linear model (3.1) and the gamma distribution. More 
specifically, we generate the y,,(k € Uandgq = 1, 2, 3) 
according to a gamma distribution with parameters a), (x;) 
and a,(x,) defined by: 

Bex 


2 
Gig Qe) = yy 0g oes 
°%q By 


1 
= Bjasl = i} g = 1, 2,3. 
Pxyg 


The choice of the gamma distribution is based on its gen- 
eral form, which gives rise to a great variety of distribu- 
tions, and on the fact that it can represent the distribution 
of various types of populations (Johnson and Kotz 1970, 
p. 172). We establish a priori the parameters a), a, By 
and Pxvg (Ge 16243), namelys 


a, =2, ad =10, (8,8 63)’ = (0.75 0.65 0.60)’, 


(Pry, PryyPxy3)’ = (0.90 0.85 0.70)’. 


To generate the unit and item response probabilities, 
we consider the following exponential forms: 


Pe = exp{— (Apxx + Agug)} and 
Wok = exp { = (AigXk 1m Arg Vgk) i 


where the v;, and the v,, result from a uniform distribution 
(0; 1). The constants A;, Az, jg and Agg are such that: 
Ay = 0.15/Xy, Mg = 0.15/B,Xy and dy = Nog = 0.45 
(q = 1,2,3). Such a parameterization makes it possible to 
have an average response rate (total or partial) of approx- 
imately 70%. We could have varied these constants or used 
other continuous functions. 
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Figure 5.2 Absolute bias and MSE: the estimator fExpnpy for 
imo) 


5.1 Comparison of the Two Variants of the NPE Approach 


We consider a population of size N = 100 and draw 
asamples of sizen = 60, which we subject to the response 
mechanisms. We repeat the sampling IK times and 
calculate the bias B(lExpnp) and the mean-square error 
MSE (fExpnp)> for different values of C(C = 0). Next we 
repeat this experiment with N = 1,000 and n = 200. 
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The results of this empirical study are illustrated by the 
diagrams of IB(fExpnp) et MSE(fExpnp) aS a function of 
the constant C. From this brief study we observe, firstly, 
that the value C,, of the optimal constant C is in the inter- 
val [0 ; 1], depends on the size of the sample and decreases 
as the sample size increases (Figures 5.1 and 5.2). 

We also observe that the estimator te onor is still better 
in terms of less bias and mean square error than the esti- 
mator peens in the interval [0;1] as illustrated as an 
example in Figure 5.3 for item 3, the item the least corre- 
lated with the auxiliary variable. A very important fact to 
be noted is that for the estimator Liebe we more quickly 
reach the values of the bias and the mean square error of 
the estimator fNaive in [0;1] at C = 0.05 and outside this 
interval at C = 4. Unlike with the estimator oan the 
values of the bias and the mean square error of the esti- 
mator fExpnpx first reach maximum values at C = 0.05 
before taking on the values of the bias and mean-square 
error Of fNaive at C = 0. We also note that for a fairly 
large size n and for any value of Cin the interval [0;1], 
the variation is hardly perceptible (Figure 5.3). For this 
reason, we suggest that a compromise value be used: 
C = 0.5 (that is, h = 0.5S,,). 


0 0.050.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.5 2.0 4.0 


Constant c 


Figure 5.3 Absolute bias and MSE: the estimators ig Reese 
and CE xpnpr for item 3 


5.2 Overall Comparison of Estimators 


The complete operation of the simulation consists in 
(i) first, drawing the sample s of sizen = 200 of the popu- 
lation of size N = 1,000, (ii) then applying the unit and 
item response mechanisms to obtain sets r,(q = 1,2,3), 
and (iii) lastly, calculating, for each estimator, the values 
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of ¢and V(t). We repeat this operation K times. Once the 
experiment is completed, we calculate, as performance 
measurements, (i) the bias IB(¢) = IE(/) — tg, (ii) the 
mean square error MSE(f) = IE(f — ¢,)*, (iii) the 
expectation of the variance estimator IE(V(f)) and (iv) 
the theoretical recovery rate P,(f) = P{| ¢ - ty ats 
Zaj2lV(t)] “\_ We can also calculate, for each given 
estimator, (v) the relative error RE(/) [= B(#)/f], (vi) the 
variance V(/)[= MSE(/) — (IB())*], (vii) the relative 
bias RB(/)[= | B(¢) | /(V(¢)) 7] as well as (viii) the 
relative error of the variance estimator RE(V(f)) 
[= IB(V(t))/V(f)] in order to examine the sensitivity 
of the variance estimators to nonresponse. 


5.3 Interpretation of the Results of the Global Simulation 


I. The Prototype Estimators 


The simulation results confirm the theory. For these 
estimators, we make the following observations, based on 
Tables 5.1 to 5.4: 


(i) fexp» fReg ANd fReg: are approximately unbiased; 
(ii) MSE (fReg!) < MSE(freg) < MSE (fgxp); 


Gil) V(tResi) < VitReg) < (ta) and) 
ELV (tree) ] < IE[ V (tReg)] = E[V(tExp)]. 


For these estimators, we also expected that: 


(i) EV (tgxp) ee Vitis): TEV (tReg) aa V (tReg) and 
IEV (tRegi) a V (tree); 


(ii) Negligible relative bias [RB(f) < 0.10]; the recovery 
rates are close to the theoretical rates. The relative 
errors RE(f) and RE(V(f)) are negligible, and are 
in part due to the simulation (errors due to the limited 
number of repetitions of the experiment). 


Table 5.1 
The Values of IB(7), MSE(f) 


By y2 ¥3 
fexp —0.036 1.690 = 0/052 01525 —0.056 2.299 
fea —0.020 0.735 —0.019 0.744 ~0.030 1.446 
Feecl —0.012 0.319 —0.012 0.431 —0.021 1.202 
tNaive ~2.037 5.069 -1.937 4.535 20s O14 
fExpnpx  — 9.690 1.345 =O771 1.407 —1,228 2.604 
ee SU eR ~0.709 1.249 = 1A0) 92,345 
fRegnpr  - 0.293 0.785 —0.414 0.830 —0.895 1.834 
FReginpr 0.285 0.376 —0.407 0.520 ~0.886 1.621 


Table 5.2 


The Values of V(f), E[V(‘)] and 100*E[V, (1) ] /E[V()] 


yy Ip By8 
fpxp 1.689 1.683 29.8 1.525 1.485 29.1 2.296 2.235 26.9 
fReg 0.734 0.697 72.2 0.744 0.702 61.5 1.445 1.391 42.6 
fRegl 0.319 0.293 34.0 0.431 0.402 32.7 1.201 1.130 29.3 


iNaive 0-918 0.911 43.3 0.784 0.766 43.5 0.983 0.958 44.2 
FExpnpx 9-869 1.403 32.0 0.804 1.173 32.3 1.097 1.322 35.4 
fExpnpr 0-814 1.291 35.1 0.746 1.089 35.2 1.046 1.285 37.1 
fRegnpr 9-700 0.627 73.9 0.658 0.588 66.6 1.033 0.955 $0.5 


FReginpr 0-294 0.259 36.7 0.355 0.315 37.6 0.836 0.751 37.1 


Table 5.3 
The Values of RE(/) and RE(V(/)) 


yy y2 J3 
fExp ~0.0024 -0.0015 -0.0040 -0.0242  -0.0045 -0.0267 
ier —0.0014 -—0.0510 —0.0015 —0.0556 —0.0024 —0.0373 
tReet —0.0008 —0.0812 —0.0009 —0.0684 —0.0017 —0.0596 
ave —0.1377 —0.0083 —0.1474 —0.0230 —0.1787 —0.0260 


fExpnpx —0.0466 0.6141 — 0.0591 0.4582. — 0.0988 0.2046 


—0.0406 0.5860 — 0.0540 0.4591 —0.0917 0.2282 


Te 
t Expnpr 


—0.0198 —0.1038 —0:0315—5 —O.1077 —0.0720 —0.0752 


ak 
t Regnpr 


fReginpr ~0.0193 -0.1191  -0.0310 ~0.1124 -0.0713 —0.1015 


Table 5.4 
The Levels P, (7) at 90%, 95% and the RB(A) 


yy J2 VW 
tExp 0.873 0.922 0.027 0.870 0.914 0.042 0.852 0.904 0.037 
fReg 0.881 0.929 0.024 0.876 0.929 0.022 0.870 0.917 0.025 
treet 0.866 0.926 0.021 0.873 0.923 0.018 0.860 0.914 0.019 


iNaive (0-322 0.427 2.126 0.298 0.405 2.187 0.287 0.389 2.239 
fExpnpx 0-851 0.906 0.740 0.800 0.874 0.866 0.667 0.758 1.172 
fExpnpr 0-872 0.925 0.666 0.830 0.893 0.820 0.700 0.789 1.114 
fReenpr 0-839 0.908 0.350 0.806 0.878 0.510 0.712 0.789 0.880 
fReginpr 0-804 0.871 0.526 0.767 0.844 0.683 0.678 0.763 0.969 


Il. The Naive Estimator 


The naive estimator registers absolute values of IB(f) 
and RE(/) that are very high in relation to the other 
estimators (Tables 5.1 and 5.3). The same is true for the 
values of MSE(f) (Table 5.1). The values of the observed 
recovery rates P,,(f) as well as those of the relative bias 
RB(f) are hardly surprising, considering the size of the 
point estimate bias (Table 5.4). 
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The behaviour, in terms of variance and variance esti- 
mator (Table 5.2) of (Naive, is due to the fact that it consti- 
tutes a particular case of tExp» assuming uniform response 
mechanisms. Ina sense, this amounts to assuming that the 
data are missing randomly. 


Ill. The Adjusted Estimators 


The reduction of the bias and the mean square error 
resulting from the use of the adjusted estimators (Table 5.1) 
is quite significant, in comparison with the naive estimator, 
especially for the regression estimators (the estimators 
fReenp AN FReginy)- In terms of variance (Table 5.2), we 
have the following inequalities: 


Ah ee < V (tRegnpr) << V (fexpnpr) a VA besos): 


which are analytically difficult to demonstrate. Little 
variation [in terms of V(f) and IE(V(f))] is observed 
between items y, and y, in light of the little variation 
between the correlations (0.05). On the other hand, the 
effect of the correlation with the auxiliary variable on V(f) 
and of IE( V(f)) may be observed by comparing items y, 
and y3, then y, and y3: the variations between the corre- 
lations are greater in these two cases (0.20 and 0.15 
respectively). 

In terms of variance estimators (Table 5.2), we observe 
that: 


V(t Reginp) < V (teens) < Vlisonp)s 


as such is the case for the estimators fReg, FRegi and Exp. 
What is surprising, and is of course due to the effect of 
the auxiliary variables on the variance components relative 
to the response mechanisms, is the fact that the estimators 
ie overestimate the variance with very large absolute 
values of RE(V(f)), while the regression estimators 
fRegnp AN FReginp underestimate the variance with absolute 
values of RE(V(f) ) that are smaller in relation to those 
Of fExpnp (Table 5.3). For the estimators fExpnp, Not only 
is the total variance high in relation to that of the regres- 
sion estimators, but also the relative contribution of the 
sampling variance is low (Table 5.2). 

In terms of recovery rate (Table 5.4), the estimators 
fete yield observed rates that are closer to theoretical 
rates than the estimators felts and Peeciar: However, 
the values of the relative bias RB(f) are higher for fExpnp 
than for fRegnp ANd fReginps Which makes the confidence 
intervals less reliable. 


IN CONCLUSION 
(i) If the goal of the estimation is to reduce bias and 


mean square error, all the estimators adjusted for non- 
response perform well in relation to the uniform response 
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mechanism (which basically amounts to doing nothing 
about nonresponse). The rate of reduction of the bias of 
each estimator in relation to the naive estimator is at least 
66%. The regression estimators fRegnp ANd FReginp are the 
most promising of the various estimators considered 
(Table 5.1). 


(ii) If the goal is to construct confidence intervals, we 
need a pair of estimators [7,V(f)] that simultaneously 
minimize the absolute biases | IB(7) | and | B(V(‘)) |. 
Tables 5.1 and 5.2 clearly show that the estimators tian 
and fReginp are the best. These estimators are less sensitive 
to nonresponse if we consider the values of RE(f) and 
RE(V(f)) (Table 5.3). Nevertheless the criterion of reli- 
ability of the confidence intervals (RB(/) < 0.10) is never 
met (Table 5.4). 


(iii) The behaviour of the estimators adjusted (i) for 
item y,, which is the item the most highly correlated with 
the auxiliary variable, compared to item y3, then (ii) for 
item y, compared to item y3 (V3 being the item that is 
least correlated with the auxiliary variable), shows that 
with very strong explanatory variables (for y, and for 
0.x), better results can be achieved not only in terms of 
less bias | IB(f) | and | IB(V(f)) | but also in terms of 
less mean square error (a gain in precision in relation to 
the naive estimator) and a better recovery rate for the 
confidence intervals (Tables 5.1 to 5.4). 


(iv) The behaviour of the estimators fRegnp aNd CRegnps 
in terms of bias, variance and variance estimation, is 
consistent with the studies conducted by Sarndal and Hui 
(1981), Sarndal and Swenson (1985, 1987), Bethlehem 
(1988) and Kott (1987) on the usefulness of regression 
estimators in nonresponse situations and the importance 
of having good predictor variables for the items of interest 
and the response mechanisms. 
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Competitors to Genuine zps Sample Designs: 
A Comparison 


OLIVER SCHABENBERGER and TIMOTHY G. GREGOIRE! 


ABSTRACT 


Without-replacement list sampling with probability proportional to some measure of element size has not enjoyed 
much application in forestry because of the difficulty of implementing such sample strategies, that have been termed 
mps designs to distinguish without-replacement sampling from the well-known with-replacement pps designs. In 
this contribution, an exact ps strategy (Sunter’s variant 2), an approximate zps design (Sunter’s variant 1) and 
the Rao-Hartley-Cochran random group method are examined and the variances of the respective estimators for 
total bole volume are computed for four tree populations. The results indicate that compared to the Rao-Hartley- 
Cochran design Sunter’s variant | in general leads to higher precision if the relationship between auxiliary infor- 
mation x, and target characteristic y, is Joose but is sensitive to the ordering of the sampling frame, whereas the 
Rao-Hartley-Cochran design does not require the sampling frame to be ordered at all and appears to be superior 
if strong linear relationships between x, and y, are present. 


KEY WORDS: Probability proportional to size sampling; Fixed sample size; Approximate aps designs; Empirical 


comparison. 


1. INTRODUCTION 


Rao (1978) classifies methods for unequal probability 
sampling without replacement in two broad categories, 
(1) sampling schemes, where the inclusion probabilities 
mT, are proportional to the characteristic of interest, y;, 
and the Horvitz-Thompson z estimator /, is utilized; 
(ii) schemes that entertain statistics other than the Horvitz- 
Thompson estimator. Strategies in (i) are termed IPPS 
(inclusion probability proportional to size) and members 
of (ii) non-IPPS designs. In recent literature, e.g., Sarndal 
et al. (1992), selection probabilities when sampling with- 
replacement are denoted p, whereas their counterparts 
when sampling without replacement are denoted 7. We 
therefore call sampling designs in (i) genuine zps strategies 
in this paper. Both, IPPS and non-IPPS designs have in 
common, that under exact proportionality, i.e., m, « yp 
andn(s) = n {constant}, it is implied that Var(7) = 0 
where / is the respective estimator used. For this reason, 
it seems appealing to draw a sample without replacement 
where a, « y, and to keep the sample size fixed at the 
same time. Our interest in these methods concerns their 
utility to sampling needs in forestry. 


Several exact maps designs are available, Rao (1978) 
gives an in depth account and discussion. Their implemen- 
tation however is often a non-trivial task and numerically 
cumbersome for sample sizes usually encountered in 
forestry practice. Many of these exact mps strategies 
require enumeration of all possible samples or use algo- 
rithms that become increasingly prohibitive as 7 increases. 


A simple design, which is feasible form < 10 is described 
by Sampford (1967). 

In forestry, however, the number of samples to be 
drawn at any stage of a survey is oftentimes much larger, 
even after stratification. Consequently, one either approx- 
imates the zps selection process in a manner that allows 
the inclusion probabilities to be computed exactly, or 
approximates second-order inclusion probabilities z,, in 
a design that ensures an exact mps selection. Rao, Hartley 
and Cochran (1962) described a non-IPPS design, also 
known as the random group method, that has gained con- 
siderable attention (see also Rao 1966, 1978). It is not a 
amps design, since it utilizes an estimator other than /, to 
ensure zero variance when the z, are proportional to y;, 
but is of remarkable simplicity. An approximate maps 
design of the first kind is Sunter’s method (Sunter 1977a, 
1977b). These two designs are referred to in what follows 
as RHC and SUNI. Sunter (1986, 1989) described an exact 
mps strategy that can be applied if certain stipulated 
conditions about the ordering of the sampling frame are 
met and the possible samples can be enumerated to obtain 
m,, for some pairs of elements. To avoid enumeration we 
use an approximation to these a,,;. This scheme will be 
called variant 2 or SUN2 in what follows. 


Sarndal et al. (1992) describe the SUN1 and RHC 
strategies as entailing some loss of efficiency compared to 
corresponding maps designs, but no assessment of their 
comparative efficiency is provided. To our knowledge, none 
is extant; yet in light of the practical advantages offered 
by these designs, a comparative assessment would be helpful. 


! Oliver Schabenberger and Timothy G. Gregoire, Department of Forestry, Section Forest Biometrics, College of Forestry and Wildlife Resources, 
Virginia Polytechnic Institute and State University, Blacksburg, VA 24061-0324, U.S.A. 
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The purpose of this study is to compare the perfor- 
mance of the three strategies empirically, using data from 
forestry field studies and sampling intensities up to 10% 
which involve reasonably large samples. 

The designs SUN1, SUN2, and RHC are appropriate 
if one has access to a list of population elements from 
which the sample can be drawn. A complete enumeration 
of the target characteristic y, is not anticipated, but the 
probabilities of inclusion may be made proportional to an 
auxiliary variable x,. That is, having complete knowledge 
about x, prior to sampling, where it is surmised that x, is 
roughly proportional to y,, we try to achieve mj « x, 
while m = constant. 

In forestry such auxiliary information oftentimes is an 
easily obtainable characteristic of tree size such as height 
h, diameter at breast height d, or a combination thereof, 
which can be used to sample efficiently for bole volume 
or biomass, y. For example, the geometry of tree stems 
suggests relationships between d, h, and the volume 
contained in the tree bole that can be exploited in sampling. 
In the present investigation, the target parameter is the 
total bole volume per unit area or in an entire forest stand. 
In practice, some form of multistage sampling would be 
used, but for sake of exposition the present comparison 
includes single stage sampling only. 

For the RHC and SUN designs, the auxiliary variables 
d, d’, d*h and the tree sequence number were used. The 
sequence number was chosen as an auxiliary variable since 
in the absence of ordering by size it is clearly unrelated to 
the target characteristic. It should indicate the sensitivity 
of competing strategies to uninformative auxiliary infor- 
mation (cf. Rao 1966). 

All designs were investigated with samples of intensity 
1%, 2%, 5%, and 10%. The performance of the different 
sampling designs was gauged in terms of the variance of 
each estimator of f = Y }_,),. Ratio-of-means estimation 
following simple random sampling was used as a bench- 
mark, since it utilizes the same auxiliary information. The 
variances of the sample designs described in the following 
section were compared to the mean square error of the 
ratio-of-means estimator (ROM), evaluated using the 
second order delta method approximation in Sukhatme 
et al. (1984). 


2. SAMPLE DESIGNS 


2.1 Sunter’s Design, Variant 1 


Sunter initially proposed two different approximate aps 
designs: one relaxes the requirement of proportionality of 
inclusion probabilities 7, for a subset of the population, 
the other allows for some variation in sample size (Sunter 
1977a, 1977b; Schreuder et a/. 1990). In order that preci- 
sion not be unduly sacrificed, it is assumed in the latter 
case that the variance of n(s) is small, while in the first 


case that altering some 7, is not too serious. In this study 
only the first method was used since the RHC design 
operates with fixed sample size, too, and it is the com- 
parative feasibility of the Sunter and RHC designs that 
prompted this study. Sarndal et a/. (1992) describe the 
allocation of the sample and the computation of the inclu- 
sion probabilities in detail. For part of the population, 
T, « xX, where x, is the auxiliary information available 
for the k-th subject (or record). Let k* denote an element 
in the ordered population. Then for all elements where 
k < k* selection is carried out proportional to x,. The 
process ends if a total sample of size n is allocated or if 
k= ke = min{ mini nx) ty 2 obs Nemaitett hk) 
where t = ) j+,X;. In the latter case, the remaining 
samples are selected according to the list-sequential scheme 
of Bebbington (1975) among those elements for which 
k = k*. As Sunter points out, this sampling scheme has the 
advantage that only one pass through the sampling frame 
is necessary. Moreover, the first and second order inclusion 
probabilities can be computed during this pass through 
the file. Since the design ensures that m,; > OV k, /; 
TT) — Try > OV kK, 1 and n is fixed, the non-negative 
Yates-Grundy estimator of variance can be readily com- 
puted. The first order inclusion probabilities are obtained 
asia = nk. / Titik < kh and a= Titel iy heen 
wherein =v Nei xe andre =e / INI 
Expressions for the second order inclusion probabilities 
are given in Sarndal ef al. (1992). 

Consequently, the ordering of the population affects 
the performance of the SUN1 design, since the inclusion 
probabilities and therefore the variance depend on k* 
(see (2) below). For large sample sizes the condition 
k* = min{min{k: nx,/t, = 1}, N — n + 1} may be 
resolved in favor of K* = min{k: nx,/t, = 1}, whichin 
turn may lead to a premature switch from aps to SRS 
sampling owing to the ordering of the sampling frame. 
Note that x,/t, < xX, /t, for k’ > k need not be true 
since if x, > X,4, and t, > t,) it may well be that x,./k 
is greater or smaller than x, 4)/t,4,. It thus can happen 
that nx, > t, and nx, < t,, for some k, k’ where 
k’ > k.Inthis case, that may occur rather frequently, it 
is unclear if the switch from zps to SRS should take place 
the first time nx, = t or not. Sometimes it may happen 
that for the first two or three elements of the population 
nx, = t, but falls below ¢, for the main portion of the 
sampling frame. This is especially the case when 7 is large 
and a few very big x, appear on top of the population list. 
To stick to Sunter’s rule in such a case would in essence 
be equivalent to drawing a simple random sample. 

The z estimator for the population total can be com- 
puted as 


- M 
tSUNI = ” cA Fe (1) 
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where /; is the sample inclusion indicator function. The 
variance is obtained as 


A 1 N N 
Var(suni) = -5 DD Cov (lit) (2 ~ ze (2) 
L 


k 


which is the Yates-Grundy form with Cov(J;, J;) = 
Try — em, (Sarndal et al. 1992). We use the notation 
VARgyn; for (2) subsequently. 


2.2 Sunter’s Variant 2 


In Sunter (1986, 1989) an exact rps design is described 
for samples of size n > 2. To fix ideas let z, = x,/Tn 
and order the population such that 


ee aN IA) 


(17 — k)z} < Z,l2k=WN—n, 


where Z, = Y \,z;. Let m, denote the number of samples 
out of n still to be drawn when arriving at the k-th popula- 
tion element u,. Given that the two conditions are met, 
the following algorithm selects an exact tps sample. For 
Uz, P(u, | my) = nz;,/Z, untilm, = Oorm, = N — k; 
in the latter case discard one of the remaining units with 
probability 1 — (m,z,/Z,) and retain the others. 

It is not always possible to order the population such 
that the above conditions are met. Sunter (1986) describes 
an algorithm that checks, whether the ordering is possible. 
The inclusion probabilities are 


Te = Neg 
(3) 
aren Liz yew SN — mn — Lil > k, 
where 
Va oe EAC a) 
LZK+1 Z) 2 


Ko" 2s ee N (late) 

The remaining second-order inclusion probabilities, 
namely z,,for/ > k > N — nhaveto be obtained from 
enumeration of possible samples which is likely to be 
infeasible. Sunter argues that (3) gives a good approxima- 
tion for those pairs of elements, and this approximation 
has been used here. With these inclusion probabilities, 
(,suN2 is indicated by the right-hand-side (rhs) of (1). An 
approximation to Var (f,synp) is given by (2), wherein (3) 
is used to obtain z,,; for] > k > N — n. 

The differences between SUN1 and SUN2 are note- 
worthy. With SUN] the joint inclusion probabilities are 
computed exactly for all pairs, but the selection is not 


VS? 


genuine zps because of the introduction of SRS in part. 
In Sunter’s variant 2 the selection is exactly aps, but 
Var (¢,synp) can only be approximated. We use VARgyn 
to denote this approximation. 


2.3. RHC Design 


A description of the RHC design is straightforward; 
properties of the RHC estimator are well documented in 
Rao, Hartley and Cochran (1962), and Rao (1966, 1978). 
After fixing the sample size n, the universe of size N is 
randomly divided into 1 groups of size N; where N = YN; 
(@ = 1, n). Let X, denote auxiliary information 
for element u,in group/,k = 1,...,N;,and put X; = 
Yi, X,. From each group one element is selected with 
selection probability p;, = X;,/X;,. The estimator for the 
total in group / is given as 


Nj 
aie es Vik 
lin = yi Parerlips 


a 


where /;, is the sample inclusion indicator function for 
element uw, in group 7. The population total is then 
estimated by 


ae " (4) 


with variance 


TE ge Tear © naE Ny) 


N 
«: Lu Ty Vk/X — ?). (5) 


Note that (5) depends on the group sizes and is mini- 
mized when all are equal. In our application, we determined 
N; such that some groups were of size N; = [N/Nn] gir 
where gif denotes the greatest integer function and the 
remainder of size N; = [N/n],i¢ + 1. The number of 
groups of each size is chosen so that the sum of the group 
sizes is N. If N/nis an integer, all groups are of course of 
equal size. We denote (5) by VARpyc in the sequel. 

The RHC design is not an exact mps design, since the 
subdivision of the population introduces a source of 
randomness unrelated to the size of the auxiliary variable 
and (4) is not a Horvitz-Thompson estimator. The inclu- 
sion probability depends jointly on the size of X;, and on 
the probability of an element being assigned to group /. 
Ordering of the population has no effect on VARgyc. 
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3. TREE POPULATIONS 


Table 1 shows the tree populations under consideration 
and Figure 1 displays the relationship between the various 
choices for x, and the target characteristic for the yellow 
poplar population. We notice almost perfect proportion- 
ality between d’h and volume, the relationship between 
d and volume is clearly curvilinear, and the relationship 


(a) 


Bole volume (cu ft) 


(c) 


Bole volume (cu ft) 


0 100 200 300 400 500 600 700 800 
dbh?* height (cu ft) 


between d? and volume is intermediate. No noticeable 
trend between sequence number and volume is apparent 
in the unordered sampling frame. For the remaining three 
populations similar patterns hold. 

For the four populations and the various combinations 
of auxiliary variable and sampling intensity, there were no 
observations for which nx, > Ty, thus no records were 
measured with certainty. 


(b) 
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(d) 


Bole volume (cu ft) 


0 100 200 300 400 


Tree number 


Figure 1. Relation of bole volume to bole dimensions in yellow poplar: (a) diameter at breast height; (b) diameter squared; (c) squared 


diameter times height; (d) tree sequence number. 
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Table 1 
Tree Populations Examined in an Empirical Comparison of SUN1, SUN2, and RHC 
The Last Four Columns Contain Pearson Correlation Coefficients Between x, and y, 
: P(y3x) 
Species oN Se Es 
Nn‘) t( ft?) d d? d*h No 
Ponderosa pine Pinus ponderosa 140 9,366.6 0.99 0.99 0.99 0.31 
Yellow poplar _Liriodendron tulipifera 336 1S255.5 0.96 0.96 0.99 —0.07 
Loblolly pine Pinus taeda 437 1,835.8 0.96 0.96 0.99 —0.32 
Red pine Pinus resinosa 91 4,075.7 0.96 0.96 0.97 = (0.05 


(1) N is the number of trees in the population. 
(2) tis total voluine. 


4. RESULTS 


4.1 Comparison of Variances 


The variance of the estimators of ¢ corresponding to the 
SUN1, SUN2, and RHC design, expressed as a proportion 
of the MSE under the ROM strategy are compared in 
Table 2 for the yellow poplar population for each of the 
sampling intensities investigated and Table 3 depicts 
pertinent results for the remaining populations. For the 
SUN I strategy, the populations were ordered by decreasing 
size of X, as recommended by Sunter (1977a, 1977b). We 
focus initially on the results for the yellow poplar popula- 
tion in Table 2. 


Table 2 


Relative Performances of SUN1, SUN2 and RHC Design 
for the Yellow Poplar Population where 
Ratio-of-means Estimation (ROM) Serves as a Benchmark 


VARsun2 VARsuni VARrguc 
INE OR Tp hee 


MSErom MSErom MSERom 
1 No 4 4.8120 3.3136 4.7767 
1 Gino) ew 0.6135 0.6684 0.6731 oa.332 
1 d= 4 0.4605 0.4596 0.4613 333 
1 dh 4 0.3361 0.3378 0.3402 330 
PaO ie 05.1307 2.6346 5.0568 
2 d 7. 0.7090 0.6982 0.7081 325 
5) Cc TNO STI 0.5694 0.5751 318 
2 WR 7 404263 0.4542 0.4369 316 
5 No 17. 5.4938 1.6643 5.2793 
5 Ge Bele, , 2 OKI305 0.7808 0.7283 309 
5 Gee pA 06541 0.6992 0.6608 291 
5 dh 17 0.4603 1.2638 0.4935 285 
10" GNow)34 © 5.8326 1.0985 5.3594 
10 Be BAP 0.7385 0.7083 0.7339 247 
10 G?a034 £0167 12 0.9687 0.6864 260 
10 d*h 34 0.4298 3.0140 015037" 250 


! k* is the observation in the ordered sampling frame at which the SUN1 
design switches from rps to SRS sampling. 


For a given sampling intensity the precision of all 
designs relative to ROM increases in the order X = No, 
d,d*,d*h; i.e., with increasing proportionality between 
auxiliary variable and tree bole volume. Given that the 
approximation of the variance of SUN2 performs well, 
VARgynz can be regarded as measuring the closeness of 
the RHC and SUN] designs to matching the efficiency of 
a genuine zps selection. At low sampling intensities and 
with meaningful auxiliary information the two designs do 
not deviate much from SUN2. The performance of both 
RHC and SUN] appears to deteriorate at higher sampling 
intensities relative to SUN2 depending on the choice of 
size measure. For X = dh, in which case Pix) = 0:99 
(see Table 1), RHC is still .85 (.4298/.5037) as efficient as 
SUN2 but SUN] is only .14 (.4298/3.014) as efficient, 
when n/N% = 10. The performance of RHC and SUNI1 
relative to SUN2 improves for other choices of X which 
are less well correlated with Y. Indeed, when X = No, 
SUNI1 is much more efficient than SUN2. 

A puzzling aspect of these results is the indication that 
SUN2 is less efficient than either RHC or SUNI1 for some 
choices of auxiliary variable and sampling intensity. We 
speculate that it may be an artifact of the approximation 
of some second-order inclusion probabilities incorporated 
into VARgyno. It also may depend on the particular 
ordering used in SUNI1 or the group sizes used in RHC 
sampling, respectively. It is feasible to calculate the exact 
Var (t,;sun2) for n = 2. We did so for the ponderosa 
pine and the red pine populations. The results indicate 
that VARsyn2 approximates the precision of the SUN2 
design very well, but is slightly conservative. The ratios 
Var (£,sun2)/WARgyn2 took on values between 0.975 and 
0.999. For larger sample sizes there is no feasible way to 
determine how well the approximation VARgynp performs. 

We focus now on the comparison of RHC to SUN1, 
again with reference to Table 2. At low sampling inten- 
sities, VARsgyn; and VARpyc are essentially equivalent 
when X = ah. But using this auxiliary variable at higher 
intensities led to a substantially better performance of it. 
in some cases. The most noteworthy caseisn/N% = 10 
where /,, is nearly 6 times more precise than f,suni- 
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We surmise from these results that the better x, « p, 
holds, the better is the precision of 4, relative to f,sun 
owing chiefly to the effect of k* on VARgyn.. Small values 
of k* indicate an early switch to a SRS selection and 
coincide with small values of VARsyn2o/ VARsyn . Large 
values of k* on the other hand correspond to variance 
ratios close to 1. For yellow poplar, n/N% = 10 and 
X = d’h the SUNI design selects only three-fourths of 
the population according to a rps design; we conjecture 
that the early transition to SRS serves also as an explana- 
tion for its poor performance compared to the RHC 
design. When_XY = tree sequence number, SUN is much 
more precise than RHC, and its relative precision increases 
as n increases. 


The sharp improvement in efficiency when using an 
auxiliary variable other than tree sequence number provides 
an indication of the effectiveness of the strategies discussed 
here when X is positively correlated to Y, and to the 
liability of sampling with probability proportional to an 
auxiliary variable when it is unrelated to Y. 


The pattern evident in the results for yellow poplar are 
generally seen, also, in the results for the other species. 
Some of them are summarized in Table 3. For ponderosa 
pine SUNI relative to RHC is always less precise when 
X = d’hregardless of the sampling intensity and SUN2 
performs always best when this variable is used. For all 
species the combination n/N% = 10, X = d’h leads to 
low precision of SUN1 compared to the other designs and 
with the exception of the loblolly pine population, SUN1 
performs poorer than ratio-of-means estimation. For all 
populations, the order of magnitude better precision of 
ROM over the genuine zps, non-IPPS or approximate mps 
design when X = tree sequence number is remarkable. 


From Figure | it can be seen that the ordering of volume 
by tree numbers is haphazard, /.e., the sequence number 
carries no information about bole volume. And, there is 
a price to pay if one uses this uninformative auxiliary 
information to determine inclusion probabilities. The 
inefficiency of unequal probability sampling in presence 
of uninformative auxiliary information is an important 
limitation for the simultaneous estimation of multiple 
population attributes, where some may be closely related 
to the auxiliary design variable but others might be uncor- 
related with it. Rao (1966) discusses this point in detail and 
he proposes alternative estimators based on the unbiased 
estimators in equal probability sampling and the estimator 
lorcat) = NY iyi Ei, where &; = YR py in the RHC design. 
Applying this estimator in the case of unequal probability 
sampling leads to bias, but to better mean-square error 
performance. For the RHC design with X¥ = tree sequence 
number, the alternative estimator proposed by Rao (1966) 
improved the ratio MSErycai)/MSERom remarkably. For 
the yellow poplar population for example, these ratios were 
between 1.34 (nm = 4) and 2.58 (n = 34), corresponding 


to a mean square error of the alternative estimator of only 
28% to 48% (n = 34) of the RHC estimator (5). Similar 
patterns hold for the other tree species. 


Since the alternative estimator is inconsistent, its bias 
does not depend on n, the larger ratios within the range 
for each species appear for larger sample sizes. It thus 
seems reasonable to limit the use of this estimator to 
smaller sample sizes. When n gets larger, another alter- 
native is to use a ratio estimator, e.g., Hajek’s estimator 
N{(¥y;/7;)/(¥ 1/7;) } under a genuine zps design. 


Table 3 


Pertinent Results About the Relative Performances of 
SUN1, SUN2 and RHC Design for the Remaining 
Populations where Ratio-of-means Estimation (ROM) 
Serves as a Benchmark 


VARsun2 VARsuni VARRuc 


n/N% xX n 


Ponderosa Pine 


1 No 2. 1.9608 1.9794 1.9507 
1 dhe 2. ..0.1056 0.1096 0.1077. sala? 
2 No © 3 4399076 1.9264 DIDS 
2 OCA nO 768 0.1919 0.1859! 23135 
5 Noe @ | S7ig 2.0681 2.7819 
5 ah 47 0si3 0.3890 0.3670 129 

{OiJeot Nor S14ng S30528 2.2745 3.0294 

10  d7hyivae02e28 1.3724 0.4488 97 

Red Pine! 
6] NOMN. |. 20010 1.9485 2.0029 
9m th 2-0-9076 0.9026 0.9104 90 
5 No 5. 2.9295 2.3141 2.8236 
5 se Oe 1.3456 0.8991 87 
10 Nol 9. #375548 2.0124 3.2958 
10 adzh 9 0.8699 1.3192 0.8942 81 
Loblolly Pine 
1 No 5 4.8011 3.7104 4.7625 
1 ah 5 0.4043 0.4161 0.4174 431 
2 No 9 5.5940 3.7441 5.5044 
> Gripes gi 5129 0.5510 0.5476 419 
5 No 22. 6.5290 3.3082 6.5253 
5 Gene) " OS5035 0.6385 0.6085 406 
10 No °44 7.7977 2.6635 6.5708 
10 d*h 44 0.3854 0.7214 0.6146 375 


1 The sampling intensity 1% was omitted since it would have resulted 
ey eae 
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4.2 The Effect of Ordering on The Precision of Sunter’s 
Variant 1 


Sunter and others have noted that the precision of the 
SUNI1 design depends on the ordering of the population. 
The recommendation to sort the sampling frame by de- 
creasing size of x,’s is rooted in the assumption that 
larger x, are more likely to be proportional to y, than 
smaller ones. The goal is to apply the rps part of the SUN1 
design not only to as big a portion of the population as 
possible but also to those elements for which x, « y, 
holds best. Under this assumption it was thus advised to 
put the elements with large x, values at the top of the 
frame. However, it is clear that this is only a rough rule 
of thumb, since the assumption of greater proportionality 
with increasing size may not hold. 

To investigate the effect of ordering the ponderosa pine 
and red pine populations were first sorted by increasing 
x, and then grouped into 10 groups of approximately 
equal size. The Pearson correlation coefficient between 
xX, and y, was computed within each group and the popu- 
lations were then sorted by 


(a) groups of decreasing correlation and increasing size of 
xX, within each group, 


(b) groups of decreasing correlation and decreasing size 
of x, in each group 


and SUN1 sampling was repeated for the combinations 
of x,’s and sampling intensity 10%. Table 4 shows the 
results. 


Table 4 


Varsyni/MSERom for Ponderosa Pine and Red Pine 
and Different Ways of Ordering the Population 


Ponderosa Pine Ordered by Red Pine Ordered by 


decr.o decr.o decr.p decr.o 
xX CeCrag, linkedsye welsonea, Cheoeree aiNepye loaner 


d 0.5614 0.6165 0.6043 1.0307 1.0236 0.6454 
a’ 0.3478 0.6562 0.5869 1.2077 0.9373 0.6948 


ah 1.3724 60.861 0.4459 1.3192 0.8674 0.7461 


The results are rather surprising. For red pine the order 
by decreasing correlation improved all measures of preci- 
sion. Sorting by increasing x, within each group now made 
VARgyny very close to VARpgyc, and with x = d’h, 
VARgyn; < VARguc. Sorting by decreasing x, within 
each group achieved an even greater improvement. In con- 
trast to these results, sorting the ponderosa pine popula- 
tion by decreasing p and increasing x, made things worse. 
The very high value of 60.861 is caused by a premature 


non 


switch to SRS, since in this setting k* is only 28, corres- 
ponding to only 20% of the population being sampled zps. 
Moreover, using order of decreasing p and decreasing x, 
improved VARgyn; only for x = a7h. 

These results indicate that there may exist an order that 
minimizes VARgyn; and may yield higher precision than 
a simple ordering by decreasing value of X. But this order 
will usually differ depending upon the auxiliary informa- 
tion, and even an ordering that is reasonable on intuitive 
grounds may give unanticipated results. It is not known 
if any ordering is optimal in the sense of minimizing 
Var (f,suni) for the approximate ps design used in this 
study. According to our present knowledge no optimal 
strategy has been described. 


5. DISCUSSION AND CONCLUSION 


Employing some meaningful auxiliary information leads 
to a considerable gain in precision in the unequal proba- 
bility designs compared to a ratio-of-means estimation. 

A choice between the two Sunter designs can be made 
on grounds of the relationship between size measure and 
target characteristic. When _X « Y is strong, SUN2 offers 
advantage over SUN1, and SUN1 appears preferable when 
the relationship is weak. Based on our results, the approx- 
imate mps strategy, SUN1 and the non-IPPS design RHC 
appear to come fairly close to the efficiency offered by 
genuine ps selection. With increasing sampling intensity, 
however, the highest precision is obtained with the SUN2 
design. But the quality of the approximation VARgynp in 
this case is unclear. 

If one’s aim is to use an approximate mps or anon-IPPS 
strategy then the RHC design with estimator be appears 
to offer advantages over the Sunter design with ¢, sun, at 
least for the tree populations studied here with the objec- 
tive of estimating total bole volume. At reasonably low 
sampling intensities, both estimators appear to be equally 
precise. 

An advantage of the RHC design is its simplicity. An 
operational advantage is that it can be applied to every 
population because it is impervious to its ordering and 
provides an unbiased estimation within each group. While 
the first criterion is also met by Sunter’s variant 1, the 
ordering there clearly affects the precision of the estimator 
(suNi: Variant 2 can only be used if some ordering of the 
population meets the conditions given in Section 2.2. 
Otherwise the selection algorithm does not produce a 
sample of exactly size n. 

The precision of the RHC method, however, depends 
on the group sizes employed. The algorithm given in 
Section 2.3 is optimal. 

While a particular ordering may improve the precision 
of f,suni, it is unclear at present how to discern an optimal 
ordering and a fixed sample size. Moreover an optimal 
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ordering of one choice of auxiliary variable or attribute 
of interest may be deleterious when implemented with a 
different auxiliary variable or attribute. 

All strategies can be disastrous with uninformative 
auxiliary information. 

Finally and to the extent that computational burden is 
a meaningful criterion, RHC is arguably less burdensome 
than variant 1 of Sunter’s design. 
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